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[57] ABSTRACT 

A document stream operating; system and, method is^ dis- 
closed in which: m doaiments are .stored in one or more 
chronolo gically ordered streams; (2) the location and na ture 
of 'file storage is transparent to the user; (3 ) info rmation is 
organized as needed instead of at the time the document is 
c reated; (4) sophjajicated logic is provided for summarizing 
a l arge group of related documents at the time a user w ants 
a c oncise overview; and (5) archiving is automatic. The 
doc uments can include text, pictures, animations, software 
programs or any other type oTUata. 
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1 2 

DOCUMENT STREAM OPERATING SYSTEM Reminding is a critical function of computer-based sys- 
tems [2] [3], yet current systems supply little or no support 

FIELD OF THE INVENTION for this function. Users are forced either to use location on 

The present invenUon relates to an operating system in their graphical desktops as reminding cues or to use add-on 

whi ch- (iocur£^SnFrst ^ in a cbronologicaj ly or dered ' apphcaUons such as calendar managers. 

' ^Stream ". In other words, that is> as each cTocumen t is A solution to these disadvantages is to use a document 

p resented to"the Qperating system, the document is pla ced stream operaUng system. One such system is outlined in a 

a ccording ta a timc i ndicato r in the s equence orao^uEeri ts 1994 article [4]. However, this article fails to address many 

alrekJv st orey^tivfT^^ time in3jcato7s of the store d of the disadvantages of conventional operating systems, 
d ocuments. 

"Within this application several publications are referenced 

by arabic numerals within parentheses. Full citations for One object of the present invention is to provide a 

these and other references may be found at the end of the document stream operating system and method which solves 

specification immediately preceding the claims. The disclo- many, if not all, of the disadvantages of conventional oper- 

sures of all of these publications in their entireties are hereby ating systems. 

incorporated by reference into this application in order to Another object of the present invention is to provide a 

more fully describe the state of the art to which this document stream operating system in which documents are 

invention pertains. stored in one or more chronologically ordered streams. 

BACKGROUND OF THE INVENTION 20 An additional object of the present invention is to provide 

an operating system in which the location and nature of file 

Conventional operating systems frequently confuse inex- storage is transparent to the user, for example, the storage of 

perienced users because conventional operating systems are f^i^^ ^ handled automatically and file names are only 

not well suited to the needs of most users. For example, ^ ^ ^^^^ chooses to invent such names, 

conventional operating systems utilize separate applications ^ ^ ^^^^^ ^^^^^ invention is to provide an 

which require file and format translations. In addition, ^ing system which takes advantage of the nature of 

conventional operating systems require the user to mvent documents. For example, a conventional paper 

pointless names for files and to construct organizational document can only be accessed in one place, but electronic 

hierarchies that quicldy become obsolete Namied files are an j^^^^^j, ^e accessed from multiple locations, 

mvention of the 1950 s and the hierarchical directories are ^.r, ^ , l- r t. • • • * 

r ^ n^rti Another obiect of the present mvention is to organize 

an mvention of 1960 s. . „ . c . .i^ ^ * • 

, uj information as needed instead of at the time the document IS 

Some conventional operatmg systems employ a desktop ^^^^^^^ . ^ ^^^^^^ ^^^^^^^ ^^^^^^ 

metaphor^' which attempts to sunphfy common file opera- documents may belong to as many streams as seems 

tions by presenting the operations in the familiar language of reasonable or to none, 

the paper-based world, that is, paper documents as files, 35 i i.- ! r.i. • ♦ -a 

folders as directories, a trashcan for deletion, etc. Also, the ^ additional object of the present invenUon is to provide 

paper-based model is a rather poor basis for organizing ^P^^^^^^g ^y^^em m which archiving is automatic, 

information where the state of the art is sUU a messy desktop A fiirther object of the present invg n tio n i s t Q p rolan 

and where one's choices in creaUng new information para- operating system with sophisticated logic for summanan g 

digms is constrained [1]. 40 Qr compressinR a large group ot related documentTwhen the 

r« 1 a: , «t user wan ts a concise overview. I n addition, this sum manzing 

Thus, conventional operating systems suffer from at least : — — : — - - 7~ — f — -; ® 

the following disadvanuges: (1) a file must be "named" caji mclude ^ctorp, sounds and/or ammations Abo^o 

when created and often a location in which to store the file m^TE ow m-S^caifall mto a ftive n categSiy Tthe, 

must be indicated resulting in mmeeded overhead; (2) users o jeraUng system is capable of presenUng an overview in a 

, °. r cj * . Form so that all the docu ments are accessible trom a sing le 

are required to store new information in fixed categories, 45 — . = 

that is directories or subdirectories, which are often an sffiSa*- 

inadequate organizing device; (3) archiving is not automatic; Also, an object of the present operating system is to make 

(4) little support for "reminding" functions are provided; (5) "reminding" convenient. 

accessibility and compatibility across data platforms is not Another object of the present invention is to provide an 

provided and (6) the historical context of a document is lost 50 o pcra^ng system in wmch personal data is widely access ible 

because no tracking of where, why and how a document a nywhere and compatibility across platforms is autoEQa tic. 

evolves is performed. "Naming" a file when created and A ccordmgly, this invention provides that coinputers usi ng 

choosing a location in which to place the file is unneeded th e operatin g system o f the present invention n eed po^ be 

overhead: when a person grabs a piece of paper and starts indepenc^nt data stor age device s, b ut also act as "vicw - 

writing, no one demands that a name be bestowed on the ss p omTs" to data~sto_red andiaaintaineB on external syste ms 

sheet or that a storage location be found. Online, many s uch as the I>rrERNET. T hus, in accordance with j he 

filenames arc not only pointless but useless for retrieval p resent invention usersTan access their personal docume nt 

purposes. Storage locations are effective only as long as the strfeams from any avM^ble plattorpi s uch as a UNjX 

user remembers them. machine, a Ma'cintosh or IBM-compatible person al 

Data archiving is an area where conventional electronic 60 compu^ a p ersoiial digital assis ta nt (PDA) , or ajctjo^Lfeox 

systems perform poorly compared to paper-based systems. v^^geT 

Paper-based systems are first and foremost archiving According to one embodiment of the invention a co m- 

systems, yet data archiving is diflScult in conventional desk- piiter program'^or organizing one or more data units is 

top systems. Often, users throw out old data rather than provided. T h e computer program includes: (1) meansfor 

undertaking the task of archiving and remembering how to 65 recgivij g o ne or more of t he data units, ea ch of which is 

get the data back. If archiving and retrieval of documents is a^ciated with o'bror 'nTnre chronn^nfTir^al inHiVafnpi; and 

convenient, old infomiation could be reused more often. ( 2) means for finking each of the data units according to the ^ 
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c hronologica l indicator s to generate one or more streams of 
data units: Other embodiments of the invent ion also provide: 
(l y^nronoiogltJal iuUicaiuis incltidinfe past, present, and 
fSture time s; an d (2) means for displaying the streams, 
wherein respective indicia representing the data units are 
displayed and each data imil includes textual data, video 
data, audio data and/or multimedia data. The means for 
displaying the streams may further include displaying 
selected segments of the streams corresponding to selected 



ylish and handle personal commupication, scheduling, and 
s earch and retrieval tasks. U^ ne embod iment of the present 
invention utilizes a machine -m jependent^client /server opca 
ar chitecture so th at users can continue to use the same 
oonventional document types, viewers ana editor^. ^ 



A "stream^' according to the present invention is a tinac - 
ordered sequenc e of documents that functions as a diary of 
a p erson o r an enti ty's electronic life, "b yery oocument 
crfescea and every document send to a persoii or entity is 
intervals of time. T he means for receiving may furth er lo stored in a main stream. The tail of a stream contains 
i nclude means for receivi|i^ j|ata units from the Wo^d^^de documents from the past, for example starting with an 
Web or from a client computer. electronic birth certificate or articles of incorporation. Mov- 

According to another embodiment of the invention, a ing away from the tail and toward the present and future, that 
metho d of organiziqg.one or more data units is provid ed is, toward head of the stream more recent documents are 
ifaclu ding Uie'steS^_ (I) receividg one 6r more data'units, 15 found including papers in progress or new electronic mail. 
eacITof whicia is associated with one or more chronological A document can contain any type o f data including but not 
indicators; and (2) Hnking each of the data units according li paited to pictures, correspondence, bills, movigsrvgiggma il 
to the chronological indicators to generate one or more and software programs. Moving beyond the present andjnto 
streams of data units. In other embodiments, the chrono- tfiTfutui'e, the Ktf 6am "contains documents allotted to futu re 
logical indicators may include past, present, and fiiture 20 t imes and events, such as, reminders, calendar items an d 
times. The method may further include the st_eps of:j fl) to-do lists. T ime-based ordering is a natural guide to expc- 

nence. lime is the attribute that comes closest to a universal 
skeleton-key for stored experience. Agcordingly, streain s 
add historic al context to a document colle ction with all 
locumenis 



amrany becoming read-only, analogously as 

history becomes "set in stone^ '. The stream preserves'the 
order and method of document creation. Also, like a diary , 
a stream records evolving work, correspondence and trans- 



displaying Ttie streams, wtierem respecdveTndicia represe nt 
ea ch data umt and each of the data units may be tex tua rdata . 
video data , audio data a n d/or multimedia data . TTie step of 
dlspla ymg the streamilnay further include the sl!eps of: ^1 ) 25 
receiving t'rom a user one or niore values iodicative of one 
o r more selected sc gmenf^ffE? sfreams correspondj^^ s 
setecled intervals of time; an d (2) displaying the segments of 
tttSTITrarnrrorreSponSing to the selected intervals of time. 



a ctions becaii5^ hi5| tnrical context can be crucial in an 
"These and other advantages of the present invention wiU organizational setting. 



become apparent from the detailed description accompany- 
ing the claims and the attached figures. 

DESCRIPTION OF THE DRAWINGS 



35 



FIG. 1 shows a viewport in one embodiment of the 
present invention; 

FIG. 2 shows a substream menu in one embodiment of the 
present invention; 

FIG. 3 shows a list of summary types for the substream ^ 
chosen in 

FIG. 2 of the present invention; 

FIG. 4 shows the time display in one embodiment of the 
present invention; 

FIG. 5 shows a calendar-based dialog box in one embodi- 
ment of the present invention; 

FIG. 6a shows a dialog box in connection with a phone 
call in one embodiment of the present invention; 



One embodiment of this invention allows for basic opera- 
dons to be perform on a stream: new, clone, transfer, find and 
summarize. 

Users create documents bv means of the new and clone 
operations^ N ew creates a new, empty document and add s 
t he document to the main stream Gone duplicates an 
existing document and adds the duplicate to the main stream 
at a new time point. Documents can also be created indi- 
rectly through the transfer operation. The transfer operati on 
c gpies a document from one stream to another strea m. 
Creation of a document it "transparent" because docume nts, 
by^dctault, are added to the at the grescnt time"point . 
InternaJly, the document is identified by a tinie indication so 
no name is required from the user for the document. 
Nevertheless, a user can optionally name a document is 
desired. 

Some streams can be organized on the fly with the find 
operation. Find prompts for a search query, such as "all 



nr^ . „^ ^ „ «f «K««* ^,iio i« r.^^ .rr^Ur^Ai 50 E-ffiail I havcn't responded to,'' or "all faxes Pve sent to 

FIG. 00 shows a summary 01 phone calls m one cmbodi- „ , ^ ,, , f , \ o u 4 i -i 

^ ^ Schwartz and creates a substream. Substreams, unlike 



ment of the present invention; 

FIG. 7 shows a phone call record dialog box in one 
embodiment of the present invention; 

FIG. 8a shows text data used by one embodiment of the 
present invention; and 

FIG. Hb shows the result of a summarize operation in one 
embodiment of the present invention. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

This invention is a new model and system for managin g 
per sonal electronic information which uses a time-orde red 
slream as a storage model and stream filters to organize, 



55 



conventional, virtual or fixed directories which only list 
filenames, present the user with a stream "view" of a 
document collection. This view, according to the present 
invenUon contains all documents that are relevant to the 
search query. Also, unlike searches of conventional fixed 
directories, the substream is generated by default from all 
the documents in the main stream. Accordingly, individual 
substreams may overlap, that is, contain some documents 
that are the same and can be created and destroyed on the fly 
without aff^ecting the main stream or other substreams. 

The find operation creates a substream of the main stream 
or of another substream based on, for example, a boolean 

attribute-and-keyword expression or a 'chronological 

l ocate, summarize and monitor incoming informat ion. 65 expression', for example, "my last letter to Schwartz", Also, 
Together, streams and fiUers provide a unmed tramewor k substreams may point to the future, for example, "my next 
that sn^,'i:iimes many separate desktop applications to accom^ appointment". 



60 
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Once created, substreams operate dynamically, that is, if 
a iiser allows a substream to persist, the substream will 
collect new documents that match the search criteria as 
documents arrive from outside the operating system or as the 
user creates the document. This dynamic operation provides 
automatic monitoring of information because the substream 
not only organizes the documents as received but also filters 
for incoming information. For example, a substream created 
with the query "find all documents created by other people" 
would subsume a user's mailbox and automatically collect 
all arriving mail from other users. A substream remains in 
existence until destroyed by the user and acts as a filter by 
examining each new document that enters the main stream. 

Althoxigh a document may belong to any number of 
substreams, the document also enters and remains on the 
main stream. A substream, in other words, is a "subset" of 
the main stream document collection. In other words, a way 
of looking at the main stream so as to exclude certain 
documents temporarily. 

T he sum marize operation "compresses" or "squish" a 
strea mTo generate one or more overvie w dnaimentsr^rhe 
i:OTllenl of an overview document depends on t he type of 
clocuments in the stream, hor instance, it tnc stream contains 
iht daily closmg price s of aTi the stocks and m utual hinds in 
a V&tt'i tdVeshnent portt olio, the overview doiiiuments rnay 
c^t atlT" a chart displaying the h istorical pertormMicc of 
particula r sec urit ies and the usc i'ffjcl' wodh. ifflilulh er 
ram contains a hst oi lasks a user needs 



eAimiyie, ii me buusti? 
to Cfl Mplete^ the overview document might display a pri ori- 
tized "t q-do" list. Thus, the summarize operation colla pses 
a stream^into a su mm^^ documen t. This summary do cu- 
m ent is a "livc"^ jScumenTwrncn is updated as addition al 
□ents are added to the mam stream. 
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"The type of summary depends on the type of documents 
in the substream. In one embodiment of the present inven- 
tion at least one general "squish" function is provided no 
matter which stream is to be squished. Typically, however, 
the user will have a number of different squishers to choose 
from, for example, one squisher might produce a summary 
in words, while another squisher might produce a graph. 
Also "custom squishers" may be supplied by third parties or 
created by the user. 

Another aspect of the invention is that applications 
execute "inside" a stream document leaving any output in 
that document. Thus, running an application is a variant of 
the new operation. For example, to run a spreadsheet appU- 
cation such as Lotus 1-2-3, a user creates a new document 
at the head of the main stream, specifically, a "live" spread- 
sheet document. The application itself is stored on the main 
stream, or located by means of a calling card that points to 
another stream containing the application. 

A stream has three main portions: past, present, and 
future. The "present" portion of the stream holds "working 
documents", which also includes the timcpoint in the stream 
where new documents are created and where incoming 
documents are placed. As documents age and newer docu- 
ments are added, older documents pass from the user's view 
and enter the "past portion" where the documents are 
eventually "archived". By disappearing from view old infor- 
mation is automatically cleared away so the old information 
will not clutter up the workspace. At some future point if 
documents in the past portion are needed, such documents 
can be located with the find operation even if the past 
document has already been archived. ^ 

Hic " future" portion of the stream allows documents to he 
createcHn the tutu re. Future creation is a natural method of 
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postin| B S reminders, for e x antp^^-i dates and sche d- 

uling information. T he system allows users to dial to jh e 
ft iture bv selecting a future tjmepoin t t or a documen t. The 
present invention keep s the document until that future Jim e" 
tfccurs. Wb en tne time oi documents time point arrive s the 

Temmder HnT-nmpnr hrnllfytit iWn Oiftw anri the (inciiment 



enters me present pomon or ttie stream?! 

Une embodiment of the present mvention is implemented 
in a client/server architecture running over the Internet. The 
server is the workhorse of this embodiment handling one or 
more streams by storing all main stream and substream 
documents. Each view of a stream is implemented as a client 
of the server and provides the user with a "viewpoint" 
interface to document collections, that is, streams. The "look 
and feel" of the viewport may be different for different 
computing platforms but each viewport should support the 
basic operations. 

One embodiment of the present invention implements a 
client viewport using graphically based X Windows, another 
embodiment implement a client viewport solely with text in 
standard ASCII (American Standard Characters for Infor- 
mation Interchange) and yet another embodiment imple- 
ments a client viewport for the NEWTON personal digital 
assistant (PDA). The X Windows viewport provides the full 
range of ftinctionalities including picture and movie display. 
In contrast the text -only viewport has a mail-like interface 
although all the basic operations are available. Also, because 
the NEWTON PDA lacks substantial internal memory and 
relies on slow external communications, that is low 
bandwidth, a minimal stream-access method is provided. 

The X Windows viewport embodiment is shown in FIG. 
1. The interface is based on a visual representation of the 
stream metaphor 5. Users can slide the mouse pointer 10 
over the document representations to "glance" at each 
document, or use the scroll bar 20 in the lower left-hand 
corner to move through time, either into the past or into the 
future portion of the stream. 

Color and animation indicate important document fea- 
tures. A red border in one embodiment means "unseen" and 
a blue border means "writable". Open documents may be 
offset to the side to indicate when the document is being 
edited. In this embodiment incoming documents slide in 
from the left side and newly created documents pop down 
from the top and push the steam backwards by one document 
into the past. 

External applications are used to view a nd edit documen ts 
which the user can select by clicking on the docume nts 
graphical representation. I 'h e external applications speed th e 
learn ing process significantly because new users can c on- 
t inue to use familiar applications for example, con ventional 
U NIX application such as emacs, xv, and ghostview, to 
cre ate and view documents whil e using streams to organize 
aflfl^ ommunicate the docu ments. 

"^Tlie X Windows interface prominently displays the basic 
operations,' that is, New 30^ Clone 40, Xfer 50 (that is, 
transfer), Find 60, and Summarize 70 as buttons and/or 
menus. As d iscussed previously the New button creates a 
new document and adds the document to the stream at the 



60 



65 



"pr esent" timepoint . The Clone button duplicates an existing 
document and places the copy in the stream. Tfae Xfer butto n 
first prompts the user for one or mor e mail aa dcesse s arid 
then forwards the seleCted-degumentrT tie iind operation i s 
s upported through a text entry box 60 that allows ib^ ^ use ^ to 
enter a boolean search o^ierv jyhich r esults in a new su b- 
st ream being created and displayed. T he summarize menu 
70 generates a new document which displays information 
from documents in a stream in a desired format, for example, 
a graph. 
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The X Windows interface of this document also provides 
additional buttons. The Print button 80 copies a selected 
document to a printer where documents may be either 
printed conventionally or moved to a printer stream. A 
software agent which can be associated with the stream 5 
forwards each new document to an appropriate printer. The 
Freeze button 90 makes a document read-only. 

Pulldown menxis are used to select documents from 
streams or existing substreams, create summaries, initiate 
personal agents and change the clock. 10 

The Streams menu 110 allows the user to select from a list 
of locally available streams. 

FIG. 2 shows the Substreams menu 120 of one embodi- 
ment of the present invention. This menu is divided into 
three sections. The first section 130 contains a list of 
operations that can be performed on substreams, for 
example, remove. The second section 140 contains one 
menu entry labeled "Your Lifestream", and causes the 
viewport to display a user's main stream. The third section 
150 lists all of the user's substreams. As indicated by this 
third section, substreams can be created in an incremental 
fashion, that is, one substream generated from another 
resulting in a nested set of menus. In this example the nested 
menus were created by first creating a substream 
"lifestreams and david" 160 from the main stream and then ^ 
creating two substreams from this substream, "scenarios" 
and "ben" 170. Substream "scott'' 180 was created from the 
"scenarios" substream. Semanrically this incremental sub- 
streaming amounts to a boolean 'and' of each new query 
with the previous substream' s query. 

FIG. 3 shows the summarize menu 190 which lists the 
possible summary types. Choosing any of these menu 
options creates a substream summary and a new document 
containing the summary is placed on the stream. ^5 

The Personal Agents menu 200 lists a number of available 
software agent types. Personal software agents can be added 
to the user interface in order to automate common tasks. 

The embodiment illustrated in FIG. 4 always displays the 
time in the upper right hand comer of the viewport interface. 40 
This time display also acts as a "time" pull -down menu 190 
tBat allows the u ser to set the viewport time to the future or 
past via a calendar-b ased dialog box as illustrated in FIG. 5 . 
belling the viewport time causes the cursor to point to that 
timepoint position in the stream such that all documents 45 
forward of that timepoint, that is, towards the head of the 
stream have a future times tamp and all documents behind 
that timepoint, that is, towards the tail, have a past times- 
tamp. As time progresses, this" cursor moves forward 
towards the head of the stream. When the cursor slips in 50 
front of the present timepoint "future" documents are added 
to the visible part of the stream in the viewpoint, just like 
new mail arrives. 

The effect of setting the time to the future or past is to 
reset the time-cursor temporarily to a fixed position desig- ss 
nated by the user. Normally the user interface displays all 
documents from the past up to the time -cursor. Setting the 
time-cursor to the future allows the user to see documents in 
the future part of the stream. Creating a document in the 
future results in a document with a future timestamp. Once eo 
the user is finished time-tripping, the user can reset to the 
present time by selecting the "Set time to present" menu 
option in the time menu. 

In one embodiment of the present invention "browse 
cards" 100 are employed so that when the user touches a 65 
document in the stream -display with the cursor, a browse 
card appears. The purpose of the browse card is to help the 
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user identify a document by providing the user some idea of 
the document's contents in a small window. The content of 
browse cards is an abbreviated version of a document which 
as been compressed into an micro-document like an index 
card. In one embodiment, the browse card creation operation 
docs header stripping so that the browse card displays the 
first non-trivial words in a document. In another 
embodiment, complex analysis is p)erformed on the docu- 
ment contents so that 'most important' words, pictures 
and/or sounds are presented. 

Another embodiment of the present invention provides 
"calling cards" which represent or point to a stream or 
substream Every stream has a caUing card and the only way 
to reference a stream is via this calling card. In this embodi- 
ment the find operation performs as follows: (1) the user 
provides a search query; (2) an appropriate substream is 
generated; (3) the substream 's calling card is generated; and 
(4) the new caUing card is deposited as a new document at 
the head of the main stream. Every duplicated calling card 
bears on the face text, an icon or both. In the case of the find 
operation, the new calHng card is marked with the argument 
supplied by the user for the search query, for example "from: 
Schwartz and Lifestreams" or "last letter from Piffel". As a 
default in this embodiment, the interface will automatically 
display the new substream. 

Another embodiment of the present invention allows 
documents to be grouped explicitly into a substream. With 
this feature the user marks, that is, selects all documents to 
be included in the substream and groups the selected docu- 
ments into a substream by creating a new calling card. The 
new calling card comes equipped with a system-created icon 
which is marked on all documents that are part of the new 
stream and the user may add any other notation to the face 
of the new calling card, for example, "these should be 
merged together to produce the Zeppelin report." 

In the embodiments with calling cards the "transfer" 
operation takes two arguments: a document and a calling 
card so that the document is copied onto the stream desig- 
nated by the calling card. The document may itself be a 
calling card and depending on instructions from the user, 
either the calHng card itself or the stream designated by the 
calhng card is copied onto the new stream. 

Each main stream in this embodiment has a calling card 
which allows 'inter-main stream' communication. To com- 
municate a user includes on the face of the calling card for 
the user's main stream whatever information the user is 
willing to make public. Other users wanting to send that 
elearonic mail will need a copy of that user's calling card, 
which might be, for example, "Rock Q. Public, Blimp 
Mechanic, Passaic N.J." 

To give only limited access to user's stream, a user 
provides a copy of the calling card customized to provide the 
desired access. Minimal access gives other users append- 
only privileges, that is, user B can send user A mail, but 
cannot view anything on user A's stream. Access restrictions 
beyond "minimal" are stated in terms of substreams. In other 
words, a calling card gives access to all documents con- 
tained in the specified substream unless that document is 
also contained on one of specified excluded substreams. 

The present invention allows a stream document to con- 
tain another stream, that is a ^stream envelope'. A stream 
envelope is equivalent to a 'value' calling card versus the 
'reference' calling cards discussed above. In other words, 
rather than point to another stream with a calling card, the 
stream envelope contains a copy all the documents from the 
other stream. For example, user A transfers to user B a 
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substream consisting of all Zeppelin-manual correspon- sliding the magnifier, you change the part of the stream 
dence which contains many documents. However, a single currently displayed in the main perspective view, 
new document appears on user B's stream: a stream enve- jn another embodiment of the present invention at least 
lope. The stream envelope may be opened yielding the many some part of the stream is in the fonn of a conventional 
documents of the forwarded stream. 5 calendar month display. With this display, the stream- 
According to the present invention, a text-editor designed segment associated with day n appears as a list of document 
specifically for stream A can treat a document as a stream of headers in n's calendar box. 

bytes so that software agents designed to 'ride* streams To contemplate the future instead of the past, according to 

could ride documents as well. Also, the stream find opera- one embodiment of the present invention, a user can reverse 

tion can scan the streamed document and synchronization lo the stream so that the head of the stream now appears in 

based on stream properties can be applied to the streamed g-ont, with the nearest-future document immediately behind 

document. that document, next -nearest behind that and so forth. The 

Streams can be copied and combined into new streams, user then looks from the present into the future so that 

that is, streams can be merged. For example, if a user furthest- away document in the display equals furthest-away 

acquires stream segments from ten electronic newspapers in future time. 

and magazines all covering the same one -month period, the p^i documents older than some date d may be moved by 
segments can be merged in a sorted order into a single Qie server from immediately-accessible storage to cheaper, 
combined stream. long-term storage. When a document is archived in this way, 
Another feature of the present invention is a card gallery however, the browse card of that document may remain 
which consists of some reasonable number of available in immediately-accessible storage, so that the 
microdocuments, for example twelve, arranged in such a archived document appears in the regular way in the view- 
way that each is always fully visible on the viewport for port. When a user opens an archived document, the user may 
example, in two columns of 6 each at the right of the display. incur some delay as the server locates and reloads the body 
Each micro-document is a calling card or a micro-browse- ^ of the document. 

card (MBC) to a "regular'* document on the stream or in a Automatic archiving is a feature of the standalone 

squish. The micro browse card in the card gallery represent embodiment and user-managed web site embodiment. In 

documents a user has been working on. Whenever a user embodiment, the streams operating system monitors 

opens a document or creates a substream or squish, the remaining disk space and when available space is low, the 

corresponding micro browse card is added to the card operating system asks the user to pop in some diskettes or 

gallery. A user can re-open the document, squish or have the q^^^qj- storage media. Similarly when an archived document 

viewport display the substream, by clicking on, or otherwise needs to be reloaded, the operating system tells the user 

selecting, the corresponding micro browse card. which diskettes or other storage media to insert. In another 

The micro browse card is administered as a least-recently- embodiment of the present invention a chat feature is 
used cache, that is, new cards are dealt on top of the ^5 provided. If two users want to chat onhne in UNIX TALK 
least-recently -used existing card, however, users can over- style; the user creates a new stream and each user focusses 
ride this mechanism and place or lock a card in the gallery. the viewport on that new stream. To make a comment, a user 
For example, a live squish can act as weather-station, pops a new document on the stream head with the comment 
appointment calendar, stock ticker or other current-status contained as text inside the document. The stream synchro- 
reporter if the user locks the micro browse card for the ^ nization properties allow many users to manipulate a stream 
squish in the gallery. concurrently, and allow a user to block at the end waiting for 

Because a users* card gallery includes by default the the arrival of a new document which would mean in this case 

calling cards of streams the user has recently opened, the awaiting the next comment. 

card gallery acts, to the extent streams are used as Web sites, A chat stream by its nature provides: (1) permanent record 

as a World Wide Web "hot list." 45 and (2) support for multiple parties to a conversation. A chat 

In one embodiment of the present invention at least some stream is in this sense is a real-time bulletin board. In this 
part of the stream is in the form of a receding stack of upright regard a network bulletin board may be stored in a stream 
rectangles, framed in such a way that only the top line of providing: (1) archived comments that can be searched and 
each document is visible. A foreshortened viewing angle retrieved using the standard streams operations; (2) synchro- 
yields a view that is approximately a right triangle, the 50 nization characteristics like a chat stream; and (3) a bulletin 
bottom edge aligned with the bottom of the display and the board that can be located via the find operation, 
left edge aligned with the display's left border. In another embodiment of the present invention any 

In another embodiment of the inventive operating system software agent with the necessary access can ride your 

a 'slide rule' bar display is provided which is labelled with stream. T herefore streams can be the basis of groupware 

the endpoints of the stream, that is, the dates of the farthest- 55 systems implemented for example a ?^ a tinck nt agents. jhor 

past and farthest-future documents. The document density g^ample, Wnen user A's wants to schedule a meetin'g fT" 

can be illustrated, for example, by the amount of color software agent de parts from user A's stream to visit Jh e 

saturation of the bar at any point. This type of display aids streams ot eacn ot tnc odier intended participants. "Ea ch 

the user because some days, weeks, months or other time ' user's stream lists the current ap pointments inine stream's 

period have more associated documents, some have fewer, so fiiture portTonT^nff^acll' Pft&i aiso mc^Scles a "gocume nt 

The slide rule has a magnifier that the user can slide via a giving the user's g eneral availability in pre -a rranged ter ms 

mouse, for example up and down the bar. The magnifier ^ so ttiat tfie meeiing-maKeL^Qhware_ageni can bndcrMan d. 

obscures the portion of the sliderule that lies beneath, but the When the software agent finds an appropaaie meeting dme, 

obscured segment is replaced by an enlarged view of the the software agent posts a document to each stream's future 

small part of the stream starting at the point touched by the 65 and creates a new stream for the meeting itself. The software 

upper edge of the magnifier or, some similar protocol for agent forwards the calling card of the meeting stream to each 

defining the starting point of the magnified segment. By participant. This new stream serves as a chat stream on 
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which the participants can discuss the meeting beforehand, 
accumulate any material developed during the meeting itself 
and persists when the meeting is over as a record and a 
vehicle for post-meetiDg discussions. 

In the following embodiments a stream naturally provides 
a structure for storing technical an electronic versions of a 
Dewspaper or magazine. 

In addition, a mail-order firm might store its catalog in a' 
stream with each document describing one item. A top page 
can embed calling cards as hyperlinks so that the streams 
pointcd-to are updated automatically by the persistent sub- 
stream mechanism, pach user can also reformat the catalog 
to taste by creating a_substream containing d escriptions of 
whatever sort of object interests the user. 



.than those included in a separate calendar or schedu ling 
uU lity program . 

O ne embodiment of the present invention suppo rts an 
e lectronic bu siness card s d ocument typ e as well as a pho ne 
call rec ord* do cu ment for noting the date and time of pho ne 
contacts. In addition, the task ot cceaimgypbo'ne call record 
IS automated through a personal agent. T he personal so ft- 
ware agent is automatically attached to th e personal agent 
r nenu so mat anytime a user wants to make a call the use 

"lUe user types in the name of the callee and the agent 
searches the current stream for a business card with that 
name. If the name is found, the software agent creates and 



•• In anotheT emb odiment ol the present m vention a phone ^ ^ appropriate entries of the plione cdl record as 



conversation is stored as a time -ordered sequence of spoken 
sounds or as electronic representations. When two users 
want to have a phone conversation, the users can use 
software such as a software agent, that creates a new stream 
and hands each user the calling card. Each user's * phone 
agent* tosses digitized representations of speech frames onto 
the stream and grabs each new frame that appears, turning 
each speech frame into sound. In this scheme, phone and 
voicemail are integrated in the all-purpose stream context 



seen in FIG. 7. This functionality is similar to the use of the 
personal assistant on the Newton personal digital assistant. 
The user can later use the streams summarize operation to 
summarize the phone calls made. This results in a report as 
20 shown in FIG. 6h 

In another embodiment of the present inventio n this 
fun'cnbnality is extended to include tne tunctions or a lime 
manager. Ti me managers generally track the billable hours 
a professionaTspends on one or more projects. In stream s 
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and can be manipulated using the standard stream opera- ^ t his is easily accomphshed by creatmg a tunecard that m arks 

the startmg and ending time of each ta sk. I hese timec ards 
are just thrown onto the stream as us ed, xhen, befoSiac b 
bilTing penod, tne stream is s ummarized b y the timecards. 
resulting in a detailed billing statement for each co^tf tact. 

Another embodiment of the present invention organizes a 
user's personal finance. Large number of users already track 
their checking accounts, savings, investments, and budgets 
with applications such as QUICKEN. The types of records 
and documents used in these applications such as electronic 
checks, deposits, securities transactions, reports are conve- 
niently stored and generated by streams. 

For example, a stock quote service may forward the daily 
closing prices of a given portfolio to a user's stream at the 



tions. 

Additionally, in another embodiment of the present inven- 
tion a television source can be stored as a time-ordered 
sequence of sound -and-im age frames. Such television infor- 
mation is an archive as well as a realtime source and can be 
searched and substreamed. A television set is merely a 
viewport. Also, scheduling information can be stored in the 
television stream's future and tuning into a television station 
only requires double-clicking on the appropriate calling 
card. Similar embodiments can provide for radio stations, 
music sources, etc. 

A stream according to the present invention can be 
controUed by a voice-interface as well as a computer and 
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thereby be accessed via a conventional phone. The voice 40 end of every business day. These documents are as shown in 



interface would allow: (1) the stream to be searched and 
manipulated; (2) new objects to be installed; (3) objects to 
be transferred; and (4) other capability. 

T^e following embodiment discusses how the present 
inventi on is used for electronic mail^ To s end a message, th e 45 
user creates a newjocn iment, t or example by clicking on the 
Jew tjut'ton and composes the message usm^ a favorite 
editor. A fter composition, the message document is sent witli 
a^jnfii of the Xfcr button. Similarly, existing documents are 
easily forwarded to other users, or can be cloned and replied 50 
to. While all mail messages, both incoming and outgoing, 
are intermixed with other documents in the stream, the user 

can create a mailbox by substreaming on documents created ______ _ _ 

by other users. Auscr can also create substrcams that contain electronic checks sent to companies with an online presence; 
a subset of the mailbox substream, such^ as "all mail from 55 other checks are transcribed from written checks. The user, 



FIG. 8a. Such documents can list each stock and mutual 
fund along with its closing price, giving the user a method 
of calculating the value of the user's assets on a specific day. 
But if the user wants higher-level view of the portfolio over 
time the summarize operation can be used. For example, the 
user first selects a substream containing the stock quote 
documents and selects the "summarize by portfolio" menu 
item. T his operation compresses the data into a single cha rt 
of historical data which summarize the j) ortfolio documents 
inTl i^uEstream.^ltiis result is illustrated in FIG. SS. 

^nnth^^ embodiment of present invention provides a 
s tream-based. checkip^ account. F a cj^ ch eck Nyrijl ^^e^ ^^tes. 
a~record op tl^y-U|g?rg Rtr^am Snme nf^ the.<ie checks are 




Bob", or "all mail I haven't responded to' 

With the present invention, a reminder can be generate d 
as future electronic n a ail , t hat is a u^^a^^d maiunat wi ll 
ajtriye in the tutuy c. tf the~u se r dials to tbe future befo re 
wr i t ing a m essage document, when the message documen t is 60 
transferreolh ^ message document will not appear on rec ipi- 
ent 's stream until either tnariime arrives or the recip ient 
h appens to dial the recipient's viewport to t he set creat io n 
date. In the present, the document will be m the stream da ta 
st ructure but tne viewport wiij not show the document! By 65 
appearing jiist.in.-lime-and^iotrequiring the user to switcnlo — * 
yet another application, theseTeBlmdwB^are more effective 



in this embodiment, employs a personal software agent to 
help balance his checkbook. ALvfc ar's end the user runs^a iax 
summary wbT ch sq uishes the financial infortn atio p in the 
users stream onto income tax forms which can be se nt 
ele aronically to the Internal Revenue Service . 
^ Streams can also be used for budgeting, tracking 
expenditures, etc. Streams contain everything a user deals 
with in the user's electronic life in a convenient and search- 
able location. 

As discussed previously, every user can send out custom 
calhng cards that grant access to a user's stream. Thus, the 
particular user's stream can function as a personal World 
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Wide Web site such that the web site is merely a subset of 
the user's main stream or a substream. For the convenience 
of external users, a user can generate a "guide to this stream" 
document that functions as a top page. In the context of the 
present invention, a hyperlink, or a bookmark is just a 
calling card. By double clicking, or some comparable 
mechanism, on a calling card the viewport displays the 
specified stream. Embedding a link from one document to 
another document means to embed calling cards. 

The present invention's personal web site provides more 
features than a conventional worldwide user side because: 
(1) the web site and personal information site are unified and 
maintained simultaneously with the same toolset; (2) visitors 
to the site use the same interface as for the visitor's own 
stream, that is, the visitor can browse, create substreams and 
squish; (3) visitors can be given customized access levels so 
that friendly visitors get to see more; and (4) the personal 
web site can filter incoming documents. 

Streams of the present invention are designed to work 
with conventional World Wide Web browsers, thus opening 
a document of type web bookmark causes the appropriate 
browser to fire -up as an application the way a text editor fires 
up when the user opens a text document. However, streams 
also provide an indigenous web-browsing model. Key fea- 
tures such as calling cards and find provide this functionality 
so that the viewport itself functions as the browser. 

Streams may also be quite useful for managing informa- 
tion outside of the system. For example, keeping track of 
web bookmarks is diflGcult and bookmarks arc inconvenient 
to pass to other users. Conventional systems accomplish 
those transactions by copying a Web address from a web 
browser to an electronic mail message which the recipient 
then copies from electronic mail back to recipient's browser 
and adds this web address as a bookmark. Streams solve 
both of these problems. 

In one embodiment, an agent watches each user's book- 
mark file for each time a new bookmaiic is added and then 
adds the same bookmark to a stream as a new Web address 
docimaent. The effect of opening a Web address document in 
a stream is that the web browser comes to the foreground 
and attempts to connect to the Web address. In this way 
streams create a bookmark substream while at the same time 
making the data in the bookmarks readily available to any 
other search a \iser may make. 

Passing Web addresses around is trivial, the user merely 
copies the Web address document to another user's stream (a 
one-step process) and the Web address is automatically 
included in the recipient bookmark substream. 

A stream is a data structure that can be examined and to 
the extent possible manipxilated by many processes simul- 
taneously. Also a process may block the end of a stream, that 
is, suspend the stream operation, until awakened when a new 
document appears on the stream head. Streams need to 
support the block-at-the-end operation so that a software 
agent or what amounts to the same thing, that is, a substream 
or a Uve squish document can examine each new document 
arriving at the stream. 

A stream must support simultaneous access because: (1) 
a user creates many software agents which may need to 
examine the stream concurrently; and (2) a user may have 
granted other users limited access to the tiser's stream, and 
the user will want access to this stream even while the other 
users access the stream. 

One embodiment of the present invention is configured 
such that each server may support three to four simultaneous 
users with stream sizes on the order of 100,000 documents 
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(perhaps a year or two of documents for the average user). 
In another embodiment, the operating system is configured 
such that lifcstreams may have millions of documents or 
more. The substrcaming aspect of one embodiment of the 

5 present invention is efficiently implemented using an inverse 
index of the document collection maintained by the server. 
No real performance problems with respect to retrieval have 
occurred. Given the very large indices that are being used on 
the Interact the retrieval scheme is expected to scale to large 

10 document collections. 

Since a user is unlikely look at 10,000 documents at once 
and discern any usable information, the present invention 
does not provide the user with an entire document collection 
at once. Instead "cursors" are used to allow the user to view 

15 segments of the document collection and to load in more 
segments as needed. 

One embodiment of the invention provides a single- 
threaded server which allows a single point of access. Other 
embodiments of the present invention utilize a multi-server 
and multi-threaded approach which provides a more scalable 
architecture. 

Regar ding the term "agent" use d in this applica tion, it is 
jotea tnat this term refers one oi mree Kinos oi embedded 
c omputations : pe rsonal a gents, doc ument agents , a nd stream 
^ agents. P ersonal "agents are Typically attached to the us er 
i nterface and can automate tasks or can learn fi:om the user 's 
inte ractinns wi th ^*^''^^us^ D ocument agent s hve on docu- 
ments and are spawned faX^gij^jgii^g^gPig j example, the 
fi^t time that a documen'tTs^ccessed. Str eam a gents are 
aiiacDca to sti'djUbs and execute whenev er the st ream 
'chan ges id gome way, for example, a new documen t app ears 
on t jie stream^ 

Further, regarding the term "document", it is noted that 
t his term includes traditional text based mes, etectronic mail 
filiEs, binary files, audio data, vid eo data, and mulRmedia 
d ata. 

Additionally, this document stream operating system can 
be implemented as an independent operating system with all 

40 required subsystems such as: a storage subsystems in soft- 
ware and/or hardware for writing documents to disc drive, 
tape drives and the like; interrupt handling subsystems; and 
input/output subsystems. However, the present invention 
also encompasses implementations which utilize subsystems 

45 from other operating systems such as the Disk Operating 
System (DOS), WINDOWS, and OPERAHNG SYSTEM 7. 
In such implementations, the graphic user interface (GUI) of 
the other operating system can be replaced by the present 
invention viewports. Alternatively, the present invention can 

50 operate as a document stream utility for the other operating 
system. 

It must be noted that although the present invention is 
described by reference to particular embodiments thereof, 
many changes and modifications of the invention may 

55 become apparent to those skilled in the art without departing 
from the spirit and scope of the invention, which is only 
limited by the appended claims. For documents may have 
associated attributes used to locate the document during a 
search, for example, a special code word selected by the 

60 user 
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What is claimed is: 

1. A computer system which organizes each data unit 
received by or generated by the computer system, compris- 
ing: 

means for generating a main stream of data units and at 
least one substream, the main stream for receiving each 
data unit received by or generated by the computer 
system, and each substream for containing data units 
only from the main stream; 

means for receiving data units from other computer 
systems; 

means for generating data units by the computer system; 
means for selecting a timestamp to identify each data unit; 

means for associating each data unit with at least one 
chronological indicator having the respective times- 
tamp; 

means for including each data unit according to the 
timestamp in the respective chronological indicator in 
the main stream; and 

means for maintaining the main stream and the sub- 
strcams as persistent streams. 

2. The computer system of claim 1, wherein each times- 
tamp is selected from the group consisting of: past, present, 
and future times. 

3. The computer system of claim 1, wherein each data unit 
includes textual data, video data, audio data and/or multi- 
media data. 

4. The computer system of claim 1, wherein the means for 
receiving further comprises means for receiving data units 
from the World Wide Web. 

5. The computer system of claim 1, wherein said means 
for receiving further comprises means for receiving data 
units from a client computer. 

6. The computer system according to claim 1, further 
comprising: 

means for displaying alternative versions of the content of 
the data units. 

7. A computer system according to claim 1 further com- 
prising: 

means for summarizing the contents of data units in one 
of the streams to generate one or more overview data 
units and for including the overview data unit in one of 
the streams, 

8. A computer system according to claim 7, wherein the 
means for summarizing further comprises means for con- 
tinuously updating the overview data units to include 
changes in the contents of data units in the stream being 
summarized. 

9. A computer system according to claim 1 further com- 
prising: 

means for archiving a data unit associated with a times- 
tamp older than a specified time point while retaining 
the respective chronological indicator and/or a data unit 
having a respective alternative version of the content of 
the archived data unit. 

10. The computer system of claim 1, wherein the com- 
puter program ftirther comprises: 

means for operating on any of the streams using a set of 
operations selected by a user. 
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11. The computer system of claim 1 further comprising: 
means to generate substreams from existing substreams, 

12. A computer system as in claim 1, further comprising: 
means for generating a data unit comprising an alternative 

5 version of the content of another data unit; and 

means for associating the alternative version data unit 
with the chronological indicator of the another data 
unit. 

13. A method which organizes each data unit received by 
10 or generated by a computer system, comprising the steps of: 

generating a main stream of data units and at least one 
substream, the main stream for receiving each data unit 
received by or generated by the computer system, and 
each substream for containing data units only from the 
j5 main stream; 

receiving data units from other computer systems; 

generating data units in the computer system; 

selecting a timestamp to identify each data unit; 

associating each data unit with at least one chronological 
20 indicator having the respective timestamp; 

including each data unit according to the timestamp in the 
respective chronological indicator in at least the main 
stream; and 

maintaining at least the main stream and the substreams as 
25 persistent streams. 

14. The method of claim 13, wherein each timestamp is 
selected from the group consisting of: past, present, and 
future times. 

15. The method of claim 13, further comprising the step 
30 of displaying the streams on a display device as visual 

streams. 

16. The method of claim 15, wherein the step of display- 
ing the streams further comprises the steps of: 

a) receiving from a user one or more indications of one or 
3^ more selected segments of the streams corresponding to 

one or more selected intervals of time, and 

b) displaying the selected segments. 

17. The method of claim 13, wherein each data unit 
includes textual data, video data, audio data and/or multi- 
media data. 

18. The method of claim 13, further comprising the step 

of: 

providing access to a first stream from a second stream by 
generating a data xmit indicating the first stream. 

19. The method of claim 13, fiirther comprising the steps 

of: 

selecting access privileges to provide to a first stream 
from a second stream; and 
5Q providing access to the first stream from the second 
stream according to the access privileges. 

20. The method of claim 13, further comprising the step 
of: 

displaying data from one of the data units in abbreviated 
55 form. 

21. The method of claim 13, further comprising the step 
of: 

summarizing die contents of data units in a stream to 
generate one or more overview data units and including 
60 the overview data unit in one of the streams. 

22. The method of claim 13, further comprising the step 
of: 

archiving data units having timestamps older than a 
specified time point. 
65 23. A computer system for organizing each data unit 
received by or generated by the computer system, compris- 
ing: 
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means for generating a main stream of data units and at 
least one substream, the main stream for receiving each 
data unit received by or generated by the computer 
system, and each substream for containing data imits 
only from the main stream; means for associating each 
data unit with at least one chronological indicator 
having a respective timestamp which identifies the data 
unit; means for including each data unit according to 
the timestamp in a respective chronological indicator in 
the main stream; means for maintaining the main 
stream and substrcams as persistent streams; 

means for generating a data unit having indicia to allow 
access to a first stream from a second stream; 

means for including the data unit having the indicia in the 
second stream; and 

means for providing access to the first stream from the 
second stream in accordance with the indicia. 

24. A computer system according to claim 23 further 
comprising: 

means for providing limited access to the first stream from 
the second stream by generating a data unit indicating 
access privileges to the first stream. 

25. A computer system for organizing each data unit 
received by or generated by the computer system, compris- 25 
ing: 

means for generating a main stream of data units and at 
least one substream, the main stream for receiving each 
data unit received by or generated by the computer 
system, and each substream for containing data imits 
only from the main stream; means for associating each 
data unit with at least one chronological indicator 
having a respective timestamp which identifies the data 
unit; means for including each data unit according to 
the timestamp in a respective chronological indicator in 
the main stream; means for maintaining the main 
stream and the substreams as a persistent streams; 

means for representing one or more data units of a 
selected stream on a display device as document 
representations, each document representation includ- 
ing the timestamp of the respective data unit and the 
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order of appearance of each data representation on the 
display device determined by the timestamp of the 
respective data unit; 
means for selecting which data units are represented on 
the display device by selecting one of the document 
representations and displaying document representa- 
tions corresponding to data units having timestamps 
within a range of a timepoint; and 

means for selecting one or more of the document repre- 
sentations with a pointing device so that the data units 
represented by the selected document representations 
are further displayed with a second document repre- 
sentation comprising an alternative version of the con- 
tent of the respective data unit. 

26. A computer system as in claim 25, wherein the 
document representations form a visual stream having a 
three-dimensional effect. 

27. A computer system as in claim 26, wherein the 
three-dimensional effect further comprises a perspective 
view. 

28. A computer system as in claim 25, wherein each 
document representation comprises a polygon and the poly- 
gons overlap to form a visual stream of polygons. 

29. A computer system as in claim 25, wherein the 
alternative version is an abbreviated version. 

30. A computer system as in claim 25, wherein the 
alternative version is a caption version. 

31. A computer system as in claim 25, wherein the 
alternative version is an expanded version. 

32. A computer system as in claim 25, further comprising: 
means for selecting one or more alternative versions of 

the content of a respective data unit to display another 
alternative version of the content of the data unit. 

33. A computer system as in claim 25, further comprising: 
means for updating the display device to provide a 

document representation for data units associated with 
chronological indicators having timestamps which 
become the present time. 
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ABSTRACT 



This invention relates to customized electronic i dentification 
ot desirable objects, sucn as "ews "alUcIcs, m a n electrom c 
media enviro nment , an d in particular to a system t hat 
au tomatically constructs both a "tarpt profile" for e ach 
ta rget o b ject in the electronic media based, tor example, o n 
the frequency with which each word appears in an article 
relative to its overall frequency of use in all articles, as well 
as a "target profile interest simimary" for each user, which 
target profile interest summary describes the user's interest 
level in various types of target objects. The system then 
evaluates the target profiles against the users' target profile 
interest summaries to generate a user-customized rank 
ordered listing of target objects most likely to be of interest 
to each user so that the user can select from among these 
potentially relevant target objects, which were automatically 
selected by this system from the plethora of target objects 
that are profiled on the electronic media. Users* target profile 
interest summaries can be used to efiSciently organize the 
distribution of inforriiation in a large scale system consisting 
of many users interconnected by means of a communication 
network. Additionally, a crypto graphically -based pseud- 
onym proxy server is provided to ensure the privacy of a 
user's target profile interest summary, by giving the user 
control over the ability of third parties to access this sum- 
mary and to identify or contact the user. 

20 Claims, 13 Drawing Sheets 
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SYSTEM AND METHOD FOR PROVIDING 
CUSTOMIZED ELECTRONIC NEWSPAPERS 
AND TARGET ADVERTISEMENTS 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This patent application was originally filed as provisional 
application Serial No. 60/032,462, filed on Dec. 9, 1996 and 
is a continuation-in-part of U.S. patent application Set. No. 
08/346,425, filed Nov. 28, 1994 now U.S. Pat. No. 5,758, 
257, and titled "SYSTEM AND METHOD FOR SCHED- 
ULING BROADCAST OF AND ACCESS TO VIDEO 
PROGRAMS AND OTHER DATA USING CUSTOMER 
PROFILES", which application is assigned to the same 
assignee as the present application. 

FIELD OF INVENTION 

This invention relates to customized electronic identifi- 
cation of desirable objects, such as news articles, in an 
electronic media environment, and in particular to a system 
that automatically constructs both a "target profile" for each 
target object in the electronic media based, for example, on 
the frequency with which each word appears in an article 
relative to its overall frequency of use in all articles, as well 
as a "target profile interest summary" for each user, which 
target profile interest summary describes the user's interest 
level in various types of tar get objects. The system then 
evaluates the target profiles against the users* target profile 
interest summaries to generate a user-customized rank 
ordered listing of target objects most likely to be of interest 
to each user so that the user can select from among these 
potentially relevant target objects, which were automatically 
selected by this system from the plethora of target objects 
that are profiled, on the electronic media. Users' target 
profile interest summaries can be used to eflBciently organize 
the distribution of information in a large scale system 
consisting of many users interconnected by means of a 
communication network. Additionally, a cryptographicaUy 
based proxy server is provided to ensure privacy of a user's 
target profile interest summary, by giving the user control 
over the ability of third parties to access this summary and 
to identify or contact the user. 

PROBLEM 

It is a problem in the field of electronic media to enable 
a user to access information of relevance and interest to the 
user without requiring the user to expend an excessive 
amount of time and energy searching for the information. 
Electronic media, such as on-line information sources, pro- 
vide a vast amount of information to users, typically in the 
form of "articles," each of which comprises a publication 
item, or document that relates to a specific topic. The 
difficulty with electronic media is that the amount of infor- 
mation available to the user is overwhelming and the article 
repository systems that are connected on-line are not orga- 
nized in a manner that sufficiently simplifies access to only 
the articles-of interest to the user. Presently, a user either 
fails to access relevant articles because they are not easily 
identified or expends a significant amount of time and 
energy to conduct an exhaustive search of all articles to 
identify those most likely to be of interest to the user. 
Furthermore, even if the user conducts an exhaustive search, 
present information searching techniques do not necessarily 
accurately extract only the most relevant articles, but also 
present articles of marginal relevance due to the functional 
hmitations of the information searching techniques. There is 
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also no existing system which automatically estimates the 
inherent quality of a n article or other target object to 
distinguish among a number of articles or target objects 
identified as of possible interest to a user. 

5 Therefore, in the field of information retrieval, there is a 
long-standing need for a system which enables users to 
navigate through the plethora of information. With commer- 
cialization of communication networks, such as the Internet, 
the growth of available information has increased. Customi- 

10 zation of the information delivery process to the user's 
unique tastes and interests is the ultimate solution to this 
problem. However, the techniques which have been pro- 
posed to date either only address the user's interests on a 
superficial level or provide greater depth and intelligence at 

15 the cost of unwanted demands on the user's time and energy. 
While many researchers have agreed that traditional meth- 
ods have been lacking in this regard, no one to date has 
successfully addressed these problems in a holistic manner 
and provided a system that can fully learn and reflect the 

20 user's tastes and interests. Tliis is particularly true in a 
practical commercial context, such as on-Une services avail- 
able on the Internet. There is a need for an information 
retrieval system, that is largely or entirely passive, 
imobtnisive, undemanding of the user, and yet both precise 

25 and compreheiisive in its ability to learn and truly represent 
the user's tastes and interests. Present information retrieval 
systems require the user to specify the desired information 
retrieval behavior through cumbersome interfaces. 
Users may receive information on a computer network 

30 either by actively retrieving the information or by passively 
receiving information that is sent to them. Just as users of 
information retrieval systems face the problem of too much 
information, so do users who are targeted with electronic 
junk mail by individuals and organizations. An ideal system 

35 would protect the user firom unsolicited advertising, both by 
automatically extracting only the most relevant messages 
received by electronic mail, and by preserving the confi- 
dentiality of the user's preferences, which should not be 
freely available to others on the network. 

40 Researchers in the field of published article information 
retrieval have devoted considerable effort to finding efficient 
and accurate methods of allowing users to select articles of 
interest from a large set of articles. The most widely used 
methods of information reuieval are based on keyword 

45 matching: the user specifies a set of keywords which the user 
thinks are exclusively found in the desired articles and the 
information retrieval computer retrieves aU articles which 
contain those keywords. Such methods are fast, but are 
notoriously unreliable, as users may not think of the right 

50 keywords, or the keywords may be used in unwanted articles 
in an irrelevant or unexpected context. As a result, the 
information retrieval computers retrieve many articles 
which are imwanted by the user. The logical combination of 
keywords and the use of wild-card search parameters help 

55 improve the accuracy of keyword searching but do not 
completely solve the problem of inaccurate search results. 
Starting in the 1960's, an alternate approach to information 
retrieval was developed: users were presented with an article 
and asked if it contained the information they wanted, or to 

60 quantify how close the infonnation contained in the article 
was to what they wanted. Each article was described by a 
profile which comprised either a list of the words in the 
article or, in more advanced systems, a table of word 
frequencies in the article. Since a measure of similarity 

65 between articles is the distance between their profiles, the 
measured similarity of article profiles can be used in article 
retrieval. For example, a user searching for information on 
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a subject can write a short description of the desired infor- 
mation. The information retrieval computer generates an 
article profile for the request and then retrieves articles with 
profiles similar to the profile generated for the request. These 
requests can then be refined using ^'relevance feedback'', 5 
where the user actively or passively rates the articles 
retrieved as to how close the information contained therein 
is to what is desired. The information retrieval computer 
then uses this relevance feedback information to refine the 
request profile and the process is repeated until the user jq 
either finds enough articles or tires of the search. 

A number of researchers have looked at methods for 
selecting articles of most interest to users. An article titled 
"Social Information filtering: algorithms for automating 
*word of mouth*" was published at the CHi-95 Proceedings 15 
by Patti Maes et al and describes the Ringo information 
retrieval system which recommends musical selections. The 
Ringo system requires active feedback from the users — 
users must manually specify how much they like or dislike 
each musical selection. TTie Ringo system maintains a 20 
complete list of users ratings of music selections and makes 
recommendations by finding which selections were Hked by 
multiple people. However, the Ringo system does not take 
advantage of any available descriptions of the music, such as 
structured descriptions in a data base, or free text, such as 25 
that contained in music reviews. An article titled "Evolving 
agents for personalized information filtering*', published at 
the Proc. 9th IEEE Conf on AI for Applications by Sheth and 
Maes, described the use of agents for information filtering 
which use genetic algorithms to leara to categorize Usenet 30 
news articles. In this system, users must define news cat- 
egories and the users actively indicate their opinion of the 
selected articles. Their system uses a list of keywords to 
represent sets of articles and the records of users' interests 
are updated using genetic algorithms. 35 

A number of other research groups have looked at the 
automatic generation and labeling of cltisters of articles for 
the purpose of browsing through the articles. A group at 
Xerox Pare published a paper titled "Scatter/gather: a 
cluster-based approach to browsing large article collections" 40 
at the 15 Ann. Int'l SIGIR *92, ACM 318-329 (Cutting et al. 
1992). This group developed a method they call "scatter/ 
gather" for performing information retrieval searches. In this 
method, a collection of articles is "scattered" into a smaU 
number of clusters, the user then chooses one or more of 45 
these clusters based on short summaries of the cluster. The 
selected clusters are then "gathered" into a subcollection, 
and then the process is repeated. Each iteration of this 
process is expected to produce a small, more focused 
collection. The cluster "summaries" are generated by pick- 50 
ing those words which appear most frequently in the cluster 
and the titles of those articles closest to the center of the 
cluster. However, no feedback from users is collected or 
stored, so no performance improvement occurs over time. 

Apple's Advanced Technology Group has developed an ss 
interface based on the concept of a "pile of articles". This 
interface is described in an article titled "*A pile' metaphor 
for supporting casual organization of information in Human 
factors in computer systems" published in CHI '92 Conf. 
Proc, 627-634 by Mander, R. G. Salomon and Y. Wong. 60 
1992, Another article titled "Content awareness in a file 
system interface: implementing the 'pile' metaphor for orga- 
nizing information" was published in 16 Ann. Int'l SIGIR 
'93, ACM 260-269 by Rose E. D. et al. The Apple interface 
uses word frequencies to automatically file articles by pick- 65 
ing the pile most similar to the article being filed. This 
system functions to cluster articles into subpiles, determine 
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key words for indexing by picking the words with the largest 
TF/IDF (where TF is term (word) frequency and IDF is the 
inverse document frequency) and label piles by using the 
determined key words. 

Numerous patents address information retrieval methods, 
but none develop records of a user's interest based on 
passive monitoring of which articles the user accesses. None 
of the systems described in these patents pre sent computer 
architectures to allow fast retrieval of articles distributed 
across many computers. None of the systems described in 
these patents address issues of using such article retrieval 
and matching methods for purposes of commerce or of 
matching users with common interests or developing records 
of users' interests. U.S. Pat. No. 5,321,833 issued to Chang 
et al. teaches a method in which users choose terms to use 
in an information retrieval query, and specify the relative 
weightings of the different terms. The Chang system then 
calculates multiple levels of weighting criteria. U.S. Pat. No. 
5,301,109 issued to Landauer et al teaches a method for 
retrieving articles in a multiplicity of languages by con- 
structing "latent vectors" (SVD or PCA vectors) which 
represent correlations between the different words. U.S. Pat. 
No. 5,331,554 issued to Graham et al. discloses a method for 
retrieving segments of a manual by comparing a query with 
nodes in a decision tree. U.S. Pat. No. 5,331,556 addresses 
techniques for deriving morphological part-of-speech infor- 
mation and thus to make :use of the similarities of different 
fortns of the same word (e.g. "article" and "articles"). 

Therefore, there presently is no information retrieval and 
delivery system operable in an electronic media environ- 
ment that enables a user to access information of relevance 
and interest to the user without requiring the user to expend 
an excessive amount of time and energy. 

SOLUTION 

The above -described problems are solved and a technical 
advance achieved in the field by the system for customized 
electronic identification of desirable objects in an electronic 
media environment, which system enables a user to access 
target objects of relevance and interest to the user without 
requiring the user to expend an excessive amount of time 
and energy. Profiles of the target objects are stored on 
electronic media and are accessible via a data communica- 
tion network. In many applications, the target objects are 
informational I n nature, and so may themselves be stored on 
electronic media and be accessible via a data communication 
network. 

Relevant definitions of terms for the purpose of this 
description include: (a.) an object available for access by the 
user, which may be either physical or electronic in nature, is 
termed a "target object", (b.) a digitally represented profile 
indicating t hat target object's attributes is termed a "target 
profile", (c.) the user looking for the target object is termed 
a "user", (d.) a profile holding that user's attributes, includ- 
ing age/zip code/etc. is termed a "user profile", (e.) a 
summary of digital profiles of target objects that a user likes 
and/or dislikes, is termed the "target profile interest sum- 
mary" of that user, (f) a profile consisting of a collection of 
attributes, such that a user likes target objects whose profiles 
are similar to this collection, of attributes, is termed a 
"search profile" or in some contexts a "query" or "query 
profile," (g.) a specific embodiment of the target profile 
interest summary which comprises a set of search profiles is 
termed the "search profile set" of a user, (h.) a collection of 
target objects with similar profiles, is termed a "cluster," (i.) 
an aggregate profile formed by averaging the attributes of all 
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tar get objects in a cluster, termed a "cluster profile " (j.) a standard kinds of demographic analysis and market research 

real number determined by calculating the statistical vari- on the resulting database of partial user profiles, 

ance of the profiles of all target objects in a cluster, is termed In the preferred embodiment of the invention, the system 

a "cluster variance," (k.) a real number determined by for customized electronic identification of desirable objects 

calculating the maximum distance between the profiles of 5 uses a fimdamental methodology for accurately and ef&- 

any two target objects in a cluster, is termed a "cltister ciently matching users and target objects by automatically 

diameter/' calculating, using and updating profile information that 

The system for electronic identification of desirable describes both the users' interests and the target objects' 

objects of the present invention automatically constructs characteristics. The target objects may be published articles, 

both a target profile for each target object in the electronic 10 ptirchasable items, or even other people, and their properties 

media based, for example, on the frequency with which each are stored, and/or represented and/or denoted on the elec- 

word appears in an article relative to its overall frequency of tronic media as (digital) data. Examples of target objects can 

use in all articles, as well as a "target profile interest include, but are not limited to: a newspaper story of potential 

summary" for each user, which target profile interest sum- interest, a movie to watch, an item to buy, e-maU to receive, 

mary describes the user's interest level in various types of 15 or another person to correspond with. In all these cases, the 

target objects. The system then evaluates the target profiles information delivery process in the preferred embodiment is 

against the users' target profile interest summaries to gen- based on determining the similarity between a profile for the 

erate a user-customized rank ordered listing of tar get objects target object and the profiles of target objects for which the 

most likely to be of interest to each user so that the user can user (or a similar user) has provided positive feedback in the 

select from among these potentially relevant target objects, 20 P^^- individual data that describe a target object and 

which were automatically selected by this system from the constitute the target object's profile are herein termed 

plethora of target objects available on the electronic media. "attributes" of the target object. Attributes may include, but 

Because people have multiple interests, a target profile are not limited to, the following: (1) long pieces of text ( a 
interest, summary for a single user must represent multiple newspaper story, a movie review, a product description or an 
areas of interest, for example, by consisting of a set of 25 advertisement), (2) short pieces of text (name of a movie's 
individual search profiles, each of which identifies one of the director, name of town from which an advertisement was 
user's areas of interest. Each user is presented with those Placed, name of the language in which an article was 
target objects whose profiles most closely match the user's written), (3) numeric measurements (price of a product, 
interests as described by the user's target profile interest rating given to a movie, reading level of a book), (4) 
summary. Users' target profile interest summaries are auto- 30 associations with other types of objects (Ust of actors in a 
matically updated on a continuing basis to reflect each user's movie, list of persons who have read a document). Any of 
changing interests. In addition, target objects can be grouped these attributes, but especiaUy the numeric ones, may cor- 
into clusters based on their similarity to each other, for relate with the quaUty of the target object, such as measures 
example, based on similarity of their topics in the case where ot its popularity (how often it is ac9cssed) or of user 
the target objects are published articles; and menus auto- 35 satisfaction (number of complamts received), 
matically generated for each cluster of target objects to allow The preferred embodiment of the system for customized 
users to navigate throughout the clusters and manuaUy electronic identification of desirable objects operates in an 
locate target objects of interest. For reasons of confident!- electronic media environment for accessing these target 
ality and privacy, a particular user may not wish to make objects, which may be news, electronic mail, other pub- 
public all of the interests recorded in the user's target profile 40 Hshed documents, or product descriptions. The system in its 
interest summary, particulariy when these interests are deter- broadest construction comprises three conceptual modules, 
mined by the user's purchasing patterns. The user may which may be separate entities distributed across many 
desire that all or part of the target profile interest summary implementing systems, or combined into a lesser subset of 
be kept confidential, such as information relating to the physical entities. The specific embodiment of this system 
user's pohlical, religious, financial or purchasing behavior; 45 disclosed herein illustrates the use of a first module which 
indeed, confidentiality with respect to purchasing behavior automatically constructs a "target profile" for each target 
is the user's legal right in many states. It is therefore object in the electronic media based on various descriptive 
necessary that data in a user's target profile interest summary attributes of the target object. A second module uses interest 
be protected from unwanted disclosure except with the feedback from users to construct a "target profile interest 
user's agreement. At the same time, the user's target profile 50 summary" for each user, for example in the form of a "search 
interest summaries must be accessible to the relevant servers profile set" consisting of a plurality of search profiles, each 
that perform the matching of target objects to the users, if the of which corresponds to a single topic of high interest for the 
benefit of this matching is desired by both providers and user. The system further includes a profile processing mod- 
consumers of the target objects. The disclosed system pro- tile which estimates each user's interest in various target 
vides a solution to the privacy problem by using a proxy 55 objects by reference to the users' target profile interest 
server which acts as an intermediary between the informa- summaries, for example by comparing the target profiles of 
tion provider and the user. The proxy server dissociates the these target objects against the search profiles in users' 
user's true identity from the pseudonym by the use of search profile sets, and generates for each user a customized 
cryptographic techniques. The proxy server also permits rank-ordered listing of target objects most likely to be of 
users to control access to their target profile interest sum- 60 interest to that user. Each user's target profile interest 
maries and/or user profiles, including provision of this summary is automatically updated on a continuing basis to, 
information to marketers and advertisers if they so desire, reflect the user's changing interests, 
possibly in exchange for cash or other considerations. Mar- Target objects may be of various sorts, and it is sometimes 
keters may purchase these profiles in order to target adver- advantageous to use a single system that delivers and/or 
tisc meats to particular users, or they may purchase partial 65 clusters target objects of several distinct sorts at once, in a 
user profiles, which do not include enough information to unified framework. For example, users who exhibit a strong 
identify the individual users in question, in order to carry out interest in certain novels may also show an interest in certain 
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movies, presumably of a similar nature. A system in which information maps so produced and the application of users' 

some target objects are novels and other target objects are target profile interest summaries to predict the information 

movies can discover such a correlation and exploit it in order consumption patterns of a user allows for pre-caching of 

to group particular novels with particular movies, e.g., for data at locations on the data communication network and at 

clustering purposes, or to recommend the movies to a user 5 times that minimize the traffic flow in the communication 

who has demonstrated interest in the novels. Similarly, if network to thereby efficiently provide the desired informa- 

users who exhibit an interest in certain World Wide Web tion to the user and/or conserve valuable storage space by 

sites also exhibit an interest in certain products, the system only storing those target objects (or segments thereof) which 

can match the products with the sites and thereby recom- are relevant to the user's interests, 
mend to the marketers of those products that they place 

advertisements at those sites, e.g., in the form of hypertext BRIEF DESCRIPTION OF THE DRAWING 

Unks to their own sites. ^ illustrates in block diagram form a typical archi- 
ve abihty to measure the sunilanty of profiles descnbmg ^^^^^^^ ^^^^^^^^.^ ^^^.^ ^^^^^ ^ ^^^^ 

target objects and a user s mterests can be applied m two ^.^mized electronic identification of desirable objects 

basic ways: filtering and browsmg. Filtering is useM when 15 ^f the present invention can be implemented as part of a user 

large numbers of target objects are described in the elec- ^ ^ 

tronic media space. These target objects can for example be ^ ' -.^t,. r .j- 

articles that are received or potentially received by a user, FIG. 2 lUustratcs m block diagram form one embodiment 

who only has time to read a small fraction of them. For of the system for customized electronic identificaUon of 

example, one might potentially receive all items on the AP 20 ^^^sirable objects; 

news wire service, all items posted to a number of news FIGS. 3 and 4 illustrate typical network trees; 

groups, all advertisements in a set of newspapers, or all plG. 5 illustrates in flow diagram form a method for 

unsolicited electronic mail, but few people have the time or automatically generating article profiles and an associated 

inclination to read so many articles. A filtering system in the hierarchical menu system; 

system for customized electronic identification of, desirable 25 piGS. 6-9 illustrate examples of menu generating 

objects automatically selects a set of articles that the user is process 

likely to wish to read. Tlie accuracy of this flltering system '^^ m^ystmts in flow diagram form the operational 

improves over Ume by notmg which arUcles the user reads ^^^^ ^ customized electronic identi- 

and by generating a measuremen of the depth to whidi the ^^^^^^ ^/^^^^j^ ^^^^^ ^ 

user reads each article. This information is then used to 30 . , - , . 1 

update the user^s target profile interest summary. Browsing FIG. 11 illustrates a hierarchical cluster tree example; 

provides an alternate method of selecting a small subset of FIG. 12 illustrates in flow diagram form the process for 

a large number of target objects, such as articles. Articles are determination of likelihood of interest by a specific user in 

organized so that users can actively navigate among groups a selected target object, 

of articles by moving from one group to a larger, more 35 FIGS. 13A-B illustrate in flow diagram form the auto- 
general group, to a smaller, more specific group, or to a matic clustering process; 

closely related group. Each individual article forms a one- ^4 illustrates in flow diagram form the use of the 

member group of its own, so that the user can navigate to pseudonymous server; 

and from individual articles as well as larger groups. Tht .^^^^^^^^ ^^^^ 

methods used by the system for customized electronic 40 i^^^mation in response to a user 

idenUflcation of desirable objects allow articles to be ue • and 

grouped into clusters and the clusters to be grouped and 1"=0'' * . „ ,. r .u e ,u 

merged into larger and larger dusters. THese hierarchies of FIG. 16 lUustrates in flow diagram form he use of the 

cluster then form the basis for menuing and navigational system for accessing mformation m response to a user query 

systems to aUow the rapid searching of large numbeis of 45 ^h^n the system is a distnbuted network implementation, 

articles This same clustering technique is applicable to any DETAILED DESCRIPTION 
type of target objects that can be profiled on the electronic 

i^edia. MEASURING SIMILARITY 

There are a number of variations on the theme of devel- 
oping and using profiles for article retrieval, with the basic 50 This section describes a general procedure for automati- 
implementation of an on-line news cUpping service repre- cally measuring the similarity between two target objects, or, 
senting the preferred embodiment of the invention. Varia- more precisely, between target profiles that are autornadcally 
tions of this basic system are disclosed and comprise a generated for each of the two target objects. This similarity 
system to filter electronic mail, an extension for retrieval of determination process is applicable to target objects in a 
target objects such as purchasable items which may have 5S wide variety of contexts. Target objects being compared can 
more complex descriptions, a system to automatically build be, as an example but not limited to: textual documents, 
and alter menuing systems for browsing and searching human beings, movies, or mutual funds. It is assumed that 
through large numbers of target objects, and a system to the target profiles which describe the target objects are 
construct virtual communities of people with common inter- stored at one or more locations in a data communication 
ests. These intelligent filters and browsers are necessary to so network on data storage media associated with a computer 
provide a truly passive, intelligent system interface. A user system. The computed similarity measurements serve as 
interface that permits intuitive browsing and filtering repre- input to additional processes, which fiinction to enable 
sents for the first time an intelligent system for determining human users to locate desired target objects using a large 
the affinities between users and target objects. The detailed, computer system. These additional processes estimate a 
comprehensive target profiles and user-specific target profile 65 human user's interest in various target objects, or else cluster 
interest summaries enable the system to provide responsive a plurality of target objects in to logically coherent groups, 
routing of specific queries for user information access. The The methods used by these additional processes might in 
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principle be implemeQted on either a single computer or on where the system for customized electronic identification of 

a computer network. Jointly or separately, they form the desirable objects is activated to identify selection of interest, 

underpinning for various sorts of database systems and a particular category of on-line products for review or 

information retrieval systems. purchase by the user, it can be appreciated that there are 

Target Obiects and Attributes 5 certain unique sets of attributes which are pertinent to the 

In classical Information Retrieval (IR) technology, the particular product category of choice. For the application as 

user is a literate human and the target objects in question are part of a movie critic column (where the system identifies 

textual documents stored on data storage devices intercon- movie titles and reviews which are most interesting to the 

nccted to the user via a computer network. That is, the target users), the system is hkely to be concerned with values of 

objects consist entirely of text, and so are digitally stored on lO attributes such as these: 

the data storage devices within the computer network. (a.) title of movie, 

However, there are other target object domains that present (b.) name of director, 

related retrieval problems that are not capable of being (c.) Motion Picture Association of America (MPAA) 

solved by present information retrieval technology which child-appropriateness rating 

are applicable to targeting of articles and advertisements to 15 ,(0G, 1«PG, . . . )> 

readers of an on-line newspaper: ) '^^^^ J ^^^^^^^ 

(a.) the user is a fihn buff and the target objects are movies ^ ^^^^^j. ^^^^ ^^^^^^^ ^ particular critic, 

available on videotape. ^^^^^^ ^^^^^ ^^^^^^^ ^ ^^^^ ^^j-^^ 

(b.) the user is a consumer and the target objects are used ^ ) j^^^^^er of stars granted by a third critic, 

cars being sold. -p^^. example, a customized financial news column may be 

(c.) the user is a consumer and the target objects are presented to the, user in the form of articles which are of 

products being sold through promotional deals. interest to the user. In this case, however, an accordingly 

(d.) the user is an investor and the target objects are those stocks which are most interesting to the user may be 

publicly traded stocks, mutual funds and/or real estate ^ presented as well. 

properties. (h.). full text of review by the third critic, 

(c.) the user is a student and the target objects are classes (i.) list of customers who have previously rented this 

being offered. movie, 

(f.) the user is an activist and the target objects are (j.) list of actors. 

Congressional bills of potential concern. 30 Each movie has a different set of values for these 

(g.) the user is a net-surfer and the target objects are links attributes. This example conveniently illustrates three kinds 

to pages, servers, or news groups available on the of attributes. Attributes c-g are numeric attributes, of the sort 

World Wide Web which are linked from pages and that might be found in a database record. It is evident that 

articles on-line newspaper. they can be used to help the user identify target objects 

(h.) the user is a philanthropist and the target objects are 35 (movies)-of interest. For example, the user might previously 

charities ^^^^ rented many Parental Guidance (PG) films, and many 

(i.) the user is Ul and the target objects are ads for medical "^^^e in the 1970's. TTiis geDeralization is useful: new 

' y t films with values for one or both attributes that are numen- 

.... , , . . i_- . cally similar to these (such as MPAA rating of 1, release date 

0.) the user is an employee and the target objects are ^ (^^^^ .^^ ^^ ^.^^^^^ ^^^^ 

classifieds for potenUal employers. ^^^^^ ^^^^^^^^^ ^^^^^^^ ^^^^^^ Attributes a-b and 
(k.) the user is an employer and the target objects are ^ ^^^tual attributes. They too are important for helping 
classifieds for potential employees. ^^^j. i^^^^^ desired films. For example, perhaps the user 
(I.) the user is a lonely heart and the target objects are has shown a past interest in films whose review text 
classifies for potential conversation partners. 45 (attribute h) contains words like "chase,** "explosion," 
(m.) the user is in search of an expert and the target "explosions," "hero,** "gripping," and "superb." This gen- 
objects are users, with known retrieval habits, of an erahzation is again useful in identifying new films of inter- 
document retrieval system. est. Attribute i is an associative attribute. It records associa- 
(n.) the user is in need of insurance and the target objects tions between the target objects in this domain, namely 
are classifieds for insurance policy offers. 50 movies, and ancillary target objects of an entirely different 
In all these cases, the user wishes to locate some small sort, namely humans. A good indication that the user wants 
subset of the target objects — such as the target objects that to rent a particular movie is that the user has previously 
the user most desires to rent, buy, investigate, meet, read, rented other movies with similar attribute values, and this 
give mammograms to, insure, and so forth. The task is to holds for attribute I just as it does for attributes a-h. For 
help the, user identify the most interesting target objects, 55 example, if the user has often liked movies that customer 
where the user's interest in a target object is defined to be a C^, and customer C^g^ have rented, then the user may like 
numerical measurement of the user's relative desire to locate other such movies, which have similar values for attribute i. 
that object rather than others. Attribute j is another example of an associative attribute, 
The generality of this problem motivates a general recording associations between target objects and actors, 
approach to solving the information retrieval problems noted 60 Notice that any of these attributes can be made subject to 
above. It is assumed that many target objects are known to authenticatton when the profile is constructed, through the 
the system for customized electronic identification of desir- use of digital signatures; for example, the target object could 
able objects, and that specifically, the system stores (or has be accompanied by a digitally signed note from the MPAA, 
the abihty to reconstruct) several pieces of information which note names the target object and specifies its authentic 
about each target object. These pieces of information are 65 value for attribute c. 

termed "attributes**: collectively, they arc said to form a Hicsc three kinds of attributes are common: numeric, 

profile of the target object, or a "target profile." For example, textual, and associative . In the classical information retrieval 
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problem, where the target objects are documents (or more (c.) number of employees during each of the last 10 years 
generally, coherent document sections extracted by a text (ten separate numeric attributes), 

segmentation method), the system might only consider a (d.) percentage growth in number of employees during 
single, textual attribute when measuring similarity: the fill each of the last 10 years, 

text of the target object. However, a more sophisticated 5 (c.) dividend payment issued in each of the last 40 
system would consider a longer target profile, including quarters, as a percentage of current share price, 

numeric and associative attributes: (f.) percentage appreciation of stock value during each of 

(a.) full text of document (textual), the last 40 quarters, list of shareholders (associative), 

(b ) title (textual composite text of recent articles about the corporation 

J 10 in the financial press (textual), 

(c.) author (textual), ^ no^ng some additional attributes that are of 

(d.) language in which document is written (textual), interest in some domnains. In the case of documents and 

(e.) date of creation (numeric), certain other domains, it is useful to know the source of eacb 

(f.) date of last update (numeric), target object (for example, refeneed journal article vs. UPI 

(g.) length in words (numeric), news wire article vs. Usenet newsgroup posting vs. 

\ question-answer pair from a qucstion-and-answer list vs. 

(n.) re aamg level (numeric), 7.,.j ^-i . 

^. / ^ , ' . , . „ . , tabloid newspaper article vs. . . . ); the source may be 

(i.) quality of docunaent as rated by a thirdUarty editorial ..presented as a single-term textual attribute. Important 

agency (numeric), associative attributes for a hypertext document are the list of 

(j.) hst of other readers who have retrieved this document documents that it Unks to, and the list of documents that link 

(associative). to it. Documents with similar citations are similar with 

As another domain example, consider a domain where the respect to the former attribute, and documents that are cited 
user is an advertiser and the target objects are potential the same places are similar with respect to the latter. A 

customers. The system might store the following attributes convention may optionally be adopted that any document 
for each target object (potential customer): 25 also links to itself. Especially in systems where users can 

(a.) first two digits of zip code (textual), choose whether or not to retrieve a target object, a target 

(b.) first three digits of zip code (textual), object's popularity (or circulation) can be usefully measured 

(c.) entire five-digit zip code (textual), as a numeric attribute specifying the number of users who 

(d.) distance of residence from advertiser's nearest physi- ^^^^ '^^^eved that object. Related measurable numeric 

cal storefront (numeric) attributes that also mdicate a kmd of popularity mclude the 

, V If.,. \ • \ number of replies to a target object, in the domain where 

(e.) annual lamily income (numeric), , * l- . * j * i * - 

^ ^ ^ \ /J target objects are messages posted to an electronic commu- 

(f.) number of children (numenc), ^^^y ^^^^ computer bulletin board or newsgroup, and 

(g.) list of previous items purchased by this potential the number of links leading to a target object, in the domain 

customer (associative), 35 where target objects are interlinked hypertext documents on 

(h.) list of filenames stored on this potential customer's the World Wide Web or a similar system. A target object may 

client computer (associative), also receive explicit numeric evaluations (another kind of 

(i.) list of movies rented by this potential customer numeric attribute) from various groups, such as the Motion 

(associative), Picture-Association of America (MPAA), as above, which 

(j.) list of investments in this potential customer's invest- rates movies' appropriateness for children, or the American 

ment portfolio (associative) Medical Association, which might rate the accuracy and 

(k.) Ust of documents retrieved by this potential customer ""^^l'^ f '^"'^f^ '^^^'^''^^ P^P'="' °' random survey 

/ • . . X sample of users (chosen from all users or a selected set of 

... . , „ , t . . . i . . / . ix experts), who could be asked to rate nearly anything. Certain 

(1.) written response to Rorschach mkblot test (textual), ^^^^^ (^^^^ evaluation, which also yield numeric 

(m.) multiple-choice responses by this customer to 20 attributes, may be carried out mechanically For example, 
self-image questions (20 textual attributes), difficulty of reading a text can be assessed by standard 

As always, the notion is that similar consumers buy procedures that count word and sentence lengths, while the 

smiilar products. It should be noted that diverse: sorts of vulgarity of a text could be defined as (say) the number of 
information are being used here to characterize consumers, 50 vulgar words it contains, and the expertise of a text could be 

from their consumption patterns to their literary taste s and cmdely assessed by counting the number of similar texts its 

psychological peculiarities, and that this fact illustrates both author had previously retrieved and read using the invention, 

the flexibUity and power of the system for customized perhaps confining this count to texts that have high approval 

electronic identification of desirable objects of the present j^tingp from critics. Finally, it is possible to synthesize 
invention. Diverse sorts of information can be used as 55 certain textual attributes mechanically, for example to recon- 

attnbutes m other domains as well (as when physical, jtjyct the script of a movie by applying speech recognition 

economic, psychological and interest-related questions are techniques to its soundtrack or by applying optical character 

used to profile the appUcants to a dating service, which is recognition techniques to its closed-caption subtitles, 

indeed a possible domain for the present system), and the Decomposing Complex Attributes 

advertiser domain is simply an example. Although textual and associative attributes are large and 

As a final domam example, consider a domam where the complex pieces of data, for information retrieval purposes 

user is an stock market investor and the target objects are ^jj^y can be decomposed into smaller, simpler numeric 

publicly traded corporations. Agreat many attributes might attributes. This means that any set of attributes can be 

be used to characterize each corporation, including but not replaced by a (usuaUy larger) set of numeric attributes, and 
Umited to the following: ^5 ^j^^t prolile be represented as a vector of 

(a.) type of business (textual), numbers denoting the values of these numeric attributes. In 

(b.) corporate mission statement (textual), particular, a textual attribute, such as the fill text of a movie 
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review, can be replaced by a collection of numeric attributes 
that represent scores to denote the presence and significance 
of the words "aardvark," "aback," "abacus," and so on 
through "zymurgy" in that text. The score of a word in a text 
may be defined in numerous ways. The simplest definition is 5 
that the score is the rate of the word in the text, which is 
computed by computing the number of times the word 
occurs in the text, an d dividing this number by the total 
number of words in the text. This sort of score is often called 
the "term frequency" (TF) of the word. The definition of 10 
term frequency may optionally be modified to weight dif- 
ferent portions of the text unequally: for example, any 
occurrence of a word in the text*s title might be counted as 
a 3 -fold or more generally k-fold occurrence (as if the title 
had been repeated k times within the text), in order to reflect 15 
a heuristic assumption that the words in the title are par- 
ticularly important indicators of the text's content or topic. 

However, for lengthy textual attributes, such as the text of 
an entire document, the score of a word is typically defined 
to be not merely its term frequency, but its term frequency 20 
multiphed by the negated logarithm of the word's "global 
frequency," as measured with respect to the textual attribute 
in question. The global frequency of a word, which effec- 
tively measTures the word's uninfonnativeness, is a fraction 
between 0 and 1, defined to be the fraction of all target 25 
objects for which the textual attribute in question contains 
this word. This adjusted score is often known in the art as 
TF/IDF ("term frequency times inverse document 
frequency"). When global frequency of a word is taken into 
account in this way, the common, uninformative words have 30 
scores comparatively close to zero, no matter how often or 
rarely they appear in the text. Thus, their rate has little 
influence on the object's target profile. Alternative methods 
of calculating word scores include latent semantic indexing 
or probabiUstic models. 3S 

Instead of breaking the text into its component words, one 
could alternatively break the text into overlapping word 
bigrams (sequences of 2 adjacent words), or more generally, 
word n-grams. These word n-grams may be scored in the 
same way as individual words. Another possibility is to use 40 
character n-grams. For example, this sentence contains a 
sequence of overlapping character 5 -grams which starts "for 
e", "or ex", "r exa", "exam", "examp", etc. The sentence 
may be characterized, imprecisely but usefully, by the score 
of each possible character 5-gram ("aaaaa", "aaaab", 45 
"zzzzz") in the sentence. Conceptually speaking, in the 
character 5 -gram case, the textual attribute would be decom- 
posed into at least 26^«11 ,88 1,376 numeric attributes. Of 
course, for a given target object, most of these numeric 
attributes have values of 0, since most 5-grams do not appear 50 
in the target object attributes. These zero values need not be 
stored anywhere. For purposes of digital storage, the value 
of a textual attribute could be characterized by storing the set 
of character 5-grams that actually do appear in the text, 
together with the nonzero score of each one. Any 5-gram ss 
that is no t included in the set can be assumed to have a score 
of zero. The decomposition of textual attributes is not 
limited to attributes whose values are expected to be long 
texts. A simple, one-term textual attribute can be replaced by 
a collection of numeric attributes in exactly the same way. 60 
Consider again the case where the target objects are movies. 
The "name of director" attribute, which is textual, can be 
replaced by numeric attributes giving the scores for 
"Federico-Femni," "Woody-AUen," "Terence- Davies," and 
so forth, in that attribute. For these one-term textual 65 
attributes, the score of a word is usually defined to be its rate 
in the text, without any consideratioD of global fi-equency. 



Note that under these conditions, one of the scores is 1, 
while the other scores are 0 and need not be stored. For 
example, if Davies did direct the film, then it is "Terence - 
Davies" whose score is 1, since "Terence-Davies" consti- 
tutes 100% of the words in the textual value of the "name of 
director" attribute. It might seem that nothing has been 
gained over simply regarding the textual attribute as having 
the string value "Terence-Davies." However, the trick of 
decomposing every non-numeric attribute into a collection 
of numeric attributes proves uscfull for the clustering and 
decision tree methods described later, which require the 
attribute values of different objects to be averaged and/or 
ordinally ranked. Only numeric attributes can be averaged or 
ranked in this way. 

Just as a textual attribute may be decomposed into a 
number of component terms (letter or word n-grams), an 
associative attribute may be decomposed into a number of 
component associations. For instance, in a domain where the 
target objects are movies, a typical associative attribute used 
in profiling a movie would be a list of customers who have 
rented that movie. This hst, can be replaced by a coUection 
of numeric attributes, which give the "association scores" 
between the movie and each of the customers known to the 
system. For example, the 165th such numeric attribute 
would be the association score between the movie and 
customer #165, where the association score is defined to be 
1 if customer #165 has previously rented the movie, and. 0 
otherwise. In a subtler refinement, this association s core 
could be defined to be the degree of interest, possibly zero, 
that customer #165 exhibited in the movie, as determined by 
relevance feedback (as described below). As another 
example, in a domain where tar get objects are companies, 
an associative attribute indicating the major shareholders of 
the company would be decomposed into a collection of 
association scores, each of which would indicate the per- 
centage of the company (possibly zero) owned by some 
particular individual or corporate body. Just as with the term 
scores used in decomposing lengthy textual attributes, each 
association score may optionally be adjusted by a multipU- 
cative factor: for example, the association score between a 
movie and customer #165 might be multiplied by the 
negated logarithm of the "global frequency" of customer 
#165, i.e., the fraction of all movies that have been rented by 
customer #165. Just as with the term scores used in decom- 
posing textual attributes, most association scores found 
when decomposing a particular value of an associative 
attribute are zero, and a similar economy of storage may be 
gained in exactly the same manner by storing a fist of only 
those ancillary objects with which the target object has a 
nonzero association score, together with their respective 
association scores. 
Similarity Measures 

What does it mean for two- target objects to be similar? 
More precisely, how should one measure the degree of 
siminlarity? Many approaches are possible and any reason- 
able metric that can be computed over the set of target object 
profiles can be used, where target objects are considered to 
be similar if the distance between their profiles is small 
according to this metric. Thus, the following preferred 
embodiment of a target object similarity measurement sys- 
tem has m any variations. 

First, define the distance between two values of a given 
attribute according to whether the attribute is a numeric, 
associative, or textual attribute. If the attribute is numeric, 
then the distance between two values of the attribute is the 
absolute value of the difference between the two values. 
(Other definitions are also possible: for example, the dis- 
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tance between prices pi and p2 might be defined by |(pl- 
p2)|/(max(pl,p2)+l), to recognize that when it comes to 
customer interest, $5000 and $5020 are very similar, 
whereas $3 and $23 are oot.) If the attribute is associative, 
then its value V may be decomposed as described above into 
a collection of real numbers, representing the association 
scores between the target object in question and various 
ancillary objects. V may therefore be regarded as a vector 
with components V^, V^, V3, etc., representing the associa- 
tion scores between the object and ancillary objects 1, 2, 3, 
etc., respectively. The distance between two vector values V 
and U of an associative attribute is then computed using the 
angle distance measure, arccos (VU7sqrt((VvV)(UU'))). 
(Note that the three inner products in this expression have 
the form XY'=Xi Yi+X^ Y2+X3 Y3+ . . . , and that for 
efl&cient computation, terms of the form X,- Y,- may be 
omitted from this sum if either of the scores X, and Y,- is 
zero.) Finally, if the attribute is textual, then its value V may 
be decomposed as described above into a collection of real 
numbers, representing the scores of various word n-grams or 
character n-grams in the text. Then the value V may again 
be regarded as a vector, and the distance between two values 
is again defined via the angle distance measure. Other 
similarity metrics between two vectors, such as the dice 
measure, may be used instead. It happens that the obvious 
alternative metric, Euclidean distance, does not work well: 
even similar texts tend not to overlap substantially in the 
content words they use, so that texts encountered in practice 
are all substantially orthogonal to each other, assuming that 
TF/IDF scores are used to reduce the influence of non- 
content words. The scores of two words in a textual attribute 
vector may be correlated; for example, "Kenned/' and 
"JFK" tend to appear in the same documents. 

Thus it may be advisable to alter the text somewhat before 
computing the scores of terms io the text, by using a 
synonym dictionary that groups together similar words. The 
effect of this optional pre-alteration is that two texts using 
related words are measured to be as similar as if they had 
actually used the same words. One technique is to augment 
the set of words actually found in the article with a set of 
synonyms or other words which tend to co-occur with the 
words in the article, so that "Kennedy'* could be added to 
every article that mentions "JFK." Alternatively, words 
found in the article may be wholly replaced by synonyms, so 
that "JFK" might be replaced by "Kennedy" or by "John K 
Kennedy" wherever it appears. In either case, the result is 
that documents about Kennedy and documents about JFK 
are adjudged similar. The synonym dictionary may be sen- 
sitive to the topic of the document as a whole; for example, 
it may recognize that "crane" is likely to have a different 
synonym in a document that mentions birds than in a 
document that mentions construction. A related technique is 
to replace each word by its morphological stem, so that 
"staple", "stapler", and "staples" are all replaced by "staple." 
Common function words ("a", "and", "the" . , . ) c an 
influence the calculated similarity of texts without regard to 
their topics, and so are typically removed from the text 
before the scores of terms in the text are computed. A more 
general approach to recognizing synonyms is to tise a 
revised measure of the distance between textual attribute 
vectors V and U, namely arccos(AV(AU)Vsqrt (AV(AV)' 
AU(AU)'), where the matrix A is the dimensionality- 
reducing linear transformation (or an approximation thereto) 
determined by coUecting the vector values of the textual 
attribute, for all target objects known to the system, and 
applying singular value decomposition to the resulting col- 
lection. The same approach can be applied to the vector 



10,036 Bl 

16 

values of associative attributes. The above definitions allow 
us to determine how close together two target objects are 
with respect to a single attribute, whether numeric, 
associative, or textual. The distance between two target 
5 objects X and Y with respect to their entire multi-attribute 
profiles Py and Py is then denoted d(X,Y) or d(Pxy Py) and 
defined as: 

(((distance with respect to attribute a)(weight of attribute 
a)y'+((distance with respect to attribute b)(weight of 

10 attribute b))*+((distanc6 with respect to attribute 
c)(w6ight of attribute c))'^+ . . . )* 
where k is a fixed positive real Dtunber, typically 2, and the 
weights are non-negative real numbers indicating the rela- 
tive importance of the various attributes. For example, if the 

15 target objects are consumer goods, and the weight of the 
"color*' attribute is comparatively very small, then price is 
not a consideration in determining similarity: a user who 
likes a brown massage cushion is predicted to show equal 
interest in the same cushion manufactured in blue, and 

20 vice-versa. On the other hand, if the weight of the "color" 
attribute is comparatively very high, then users are predicted 
to show interest primarily in products whose colors they 
have liked in the past: a brown massage cushion and a blue 
massage cushion are not at all the same kind of target object, 

25 however similar in other attributes, and a good experience 
with one does not by itself inspire much interest in the other. 
Target objects may be of various sorts, and it is sometimes 
advantageous to use a single system that is able to compare 
tar get objects of distinct sorts. For example, in a system 

30 where some target objects are novels while other target 
objects are movies, it is desirable to judge a novel and 
a:movie similar if their profiles show that similar users Hke 
them (an associative attribute). However, it is important to 
note that certain attributes specified in the movie's target 

35 profile are undefined in the noveFs target profile, and vice 
versa: a novel has no "cast list" associative attribute and a 
movie has no "reading level" numeric attribute. In general, 
a system in which target objects fall into distinct sorts may 
sometimes have to measure the similarity of two target 

40 objects for which somewhat different sets of attributes are 
defined. This requires an extension to the distance metric 
d(*,*) defined above. In certain appUcadons, it is sufficient 
when carrying out such a comparison simply to disregard 
attributes that are not defined for both target objects: this 

45 allows a cluster of novels to be matched with the most 
similar cluster of movies, for example, by considering only 
those attributes that novels and movies have in common. 

However, while this method allows comparisons between 
(say) novels and movies, it does not define a proper metric 

50 over the combined space of novel s and movies and therefore 
does not allow clustering to be applied to the set of all target 
objects. When necessary for clustering or other purposes, a 
metric that allows comparison of any two target objects 
(whether of the same or different sorts) can b e defined as 

55 follows. If a is an attribute, then let Max(a) be an upper 
bound on the distance between two values of attribute a; 
notice that if attribute a is an associative or textual attribute, 
this distance is an angle determined by arccos, so that 
Max(a) may be chosen to be 180 degrees, while if attribute 

60 a is a numeric attribute, a sufficiently large number must be 
selected by the system designers. The distance between two 
values of attribute a is given as before in the case where both 
values are defined; the distance between two undefined 
values is taken to be zero; finally, the distance between a 

65 defined value and an undefined value is always taken to be 
Max(a)/2. This allows xis to determine how close together 
two target objects are with respect to an attribute a, even if 
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attribute a does not have a defined value for both target potential sources of passive feedback include an electronic 

objects. The distance d(*,*) between two target objects with measurement of the extent to which the user's pupils dilate 

respect to their entire multi- attribute profiles is then given in while the user views the target object or a description of the 

terms of these individual attribute distances exactly as target object It is possible to combine active and passive 
before. It is assumed that one attribute in such a system 5 feedback. One option is to take a weighted average of the 

specifies the sort of target object ("movie", "novel", etc.), two ratings. Another option is to use passive feedback by 

and that this attribute may be highly weighted if target default, but to allow the user to examine and actively modify 

objects of different sorts are considered to be very different the passive feedback score. In the scenario above, for 

despite any attributes they may have in common. instance, an uninteresting article may sometimes remain on 

UmiZING THE SIMILARITY MEASUREMENT ^^fj f^'^' ' '""^ Pe^d ^hlf ^^g^Jf ^ 

. , . . ^ J o 11 in unrelated business; the passive feedback score, is then 

Matching BuyeK and Selleis inappropriately high, and the user may wish to correct it 

A Simple application of the Similarity measurement IS a u / • r r j u j- * f 

, t. -.u 11 • n 1 before continuing. In the prererred embodiment oi the 

system to match buyers with sellers m small-volume . ^. • i ■ j- * u u • j- 

^ ^ , , J , J J 1 mvention, a visual indicator, such as a sliding bar or mdi- 

markets, such as used cars and other used goods, artwork, or ^ ^ ji *i_ > u j • * *• 

' „ „ . ^ °, J /. ' 15 cator needle on the user s screen, can be used is to contmu- 

employment. Sel ers submit profiles of the goods (target ^^^^^^^^ ^^^^^ ^^^^^^ ^ 

obiects) they want to sell, and buyers submit, profiles of the / r 4i: * . u- *l • • j i *u u 

, ; ^ , . . , ' ' , « . . system for the target object heme viewed, unless the user has 
goods (target obiects) they want to buy. Participants may „ j- . j *i_ * j- * u 

^ , * / ^1 * Tn- * manually adjusted the indicator by a mouse operation or 

submit or withdraw these profiles at any tune. The system -i , , * j-f/ * r *u* * 

- , ' J 1 . • 'J *vc • ui w * other means in order to reflect a different score for this target 

for customized electronic idenUflcaUon of desirable objects ^^.^^ .^^.^^^^^ ^^^^^ ^^^^^^^^ 

computes the similanues between seller-submitted profiles ^^^^ ^j^^^^^ ^^^^J.^^ ^^^^^ 

and buyer-submitted profiles, and when two promes match ^^^^^^ ^^^^ ^^^^^^^^ 

closely (i.e., the smiilanty ,s above a threshold), the corre- ^ ^^J^ ^^^^ j^^^^,^^ ^^^j 

spending seller and buyer are notified of each other s ^^^^ ^^^^ ^^^^^^ ^j^^ 

identiues. To prevent users from being flooded with iiegardless how a user's feedback is computed, it is stored 

responses, it may be desirable to hmit the number of , ^ , . c *i. * » * * ci • * 

5^, . ' , ^ • . J u u long-term as part of that user s target profile interest sum- 

no tifications each user receives to a fixed number, such as ^ ^ ^ 

f\ mary. 

Klterin -^Relevance Feedback Filtering: Determinmg Topical Interest Through Similarity 
^ ^ J?^' . . . J . .1. . 1. L Relevancefeedbackonly determines the user's' interest in 
A fl^tenng system is a device that can search through 3^ certain target objects: namely, the target objects that the user 

many target objects and estimate a given user s interest in , * Ti u j iu * •* * i * / u *u 

. ^ . L- . . -J . . r . . has actually had the opportimity to evaluate (whether 

each target obi ect, so as to identify those that are of greatest ^. , • i x ^ * * u • ^ *u * *u u 

,f ' ™ • . I f J actively or passively). For target objects that the user has not 

interest to the user. The filtermg system uses relevance feed , f. • * * *• * *u 

. . , n t 1 J rfi_ • . . u yet seen, the filtermg system must estimate the user s 

back to refine its knowledge of the user s mterests: whenever , t^- *- * i • *i. u ^ *i. ct* ■ 

. . ^ . . t_- . . 11 interest. This estimation task is the heart of the filtenng 

the filtenng system identifies a target object as polenUdly ^^^^^ ^j^^^ measurement is 

interesung to a i^r the user i^ an on- me user) provides j^^^^ concretely, the preferred embodiment of the 

feedback as to whether or not that target object really IS of ci/ • , • i- - • *u . • « 

, „ . f J. , . ^ J , •'. . J filtermg system is a news clipping service that penodically 

interest. Such feedback is stored long-term in summarized f ... i r * i • * * ^n. 

r ^rj.L c c Jt^ i • c J presents the user with news articles of potential mterest. The 

form, as part of a database of user feedback information, and ^ -j j/ • ju i * *u 

', ^ -J J Mi_ 1 -IT user provides active and/or passive feedback to the system 

IT cither actively or passively. In acUve ^ ^^^^^.^ presented articles. However, the system 

feedback, the user explicitly mchcates lus or her mter«^^^ ^^^^ ^^^^ ^^^^^^ information from the user for 

instance, on a scale of -2 (ac .ve distaste) through 0 (no ^^^j^^ ^^^^ ^^^^^ ^^^^ ^^^^^^^^ ^^^^ 
special mterest) to 10 (great mterest). Iii passive feedback, ^^.^^^ ^^^^ ^^^^ ^^^^ j^,^^^ 

system infers Uie user s m terest from the user s behavior. ^^^^^ ^^^^ , ,^ 

For example if tajget object are textual documents, the similarly, in the dating service domain where target objects 

system might monitor which documents the user chooses to ^ „„•„ „„«„„,o .u. 

' J . J I u L .1. J are prospective romantic partners, the system has only 

read, or not to read, and how much time the user spends • j jl , u ^ . .• 

J. • • . . • received feedback on old flames, not on prospective new 

reading them. A typical formula for asscssmg mterest in a joves 

document via P^ive feedback, in this domain, on a scale of ""^'shown in flow diagram form in FIG. 12, the, evalua- 

° . ' f^*^ - • J ^^^^ of likelihood of interest in a particular target object 

+2 if the second: page is viewed, ^ ^p^^^^^ automaticaUy be computed. The 

+2 if all pages are viewed, interest that a given target object X holds for a user U is 

+2 if more than 30 seconds was spent viewing the assumed to be a sum of two quantities: q(U, X), the intrinsic 

document, "quaUty" of X plus f(U, X), the "topical interest" that users 
+2 if more-than one minute was spent viewing the 55 like U have in target objects like X. For any target object X, 

document, the intrinsic quality measure q(U, X) is easily estimated at 

+2 if the minutes spent viewing the document are greater steps 1201-1203 directly from numeric attributes of the 

than half the number of pages. target object X. The computation process begins at step 

If the target objects are electronic mail messages, interest 1201, where certain designated numeric attributes of target 
points might also be added in the case of a particularly 60 object X are specifically selected, which attributes by their 

lengthy or particularly prompt reply. If the target objects are very nature should be positively or negatively correlated 

purchasable goods, interest points might be added for target with users* interest. Such attributes, termed "quality 

objects that the user actually purchases, with further points attributes," have the normative property that the higher (or 

in the case of a large -quantity or high -price purchase. In any in some cases lower) their value, the more interesting a user 
domain, further points might be added for target objects that 65 is expected to find them. Quality attributes of target object 

the user accesses early in a session, on the grounds that users X may include, but are not limited to, target object X's 

access the object s that most interest them first. Other popularity among users in general, the rating a particular 
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reviewer has given target object X, the age (time since 
authorship — also known as outdatedness) of target object X, 
the number of vulgar words used in target object X, the price 
of target object X, and the amount of money that the 
company selUng target object X has donated to the user's 5 
favorite charity. At step 1202, each of the selected attributes 
is multiplied by a positive or negative weight indicative of 
the strength of user U*s preference for those target objects 
that bave high values for this attribute, which weight must 
be retrieved from a data file storing quality attribute weights 10 
for the selected user. At step 1203, a weighted sum of the 
identified weighted selected attributes is computed to deter- 
mine the intrinsic quality measure q(U X). At step 1204, the 
summarized weighted relevance feedback data is retrieved, 
wherein some relevance feedback points are weighted more 15 
heavily than others and the stored relevance data can be 
summarized to some degree, for example by the use of 
search profile sets. The more difi&cult part of determining 
user U's interest in target object X is to find or compute at 
step 1205 the value of f(U, X), which denotes the topical 20 
interest that users like U generally have in target objects like 
X. The method of determining a user's interest relies on the 
following hewistic: when X and Y are similar target objects 
(have similar attributes), and U and V are similar users (have 
similar attributes), then topical interest f(U, X) is predicted 25 
to have a similar value to the value of topical interest q(V, 
Y), This heuristic leads to an effective method because 
estimated values of the topical interest function f(*, *) are 
actually know n for certain arguments to that function: 
specifically, if user V has provided a relevance-feedback 30 
rating of r(V, Y) for target object Y then insofar as that rating 
represents user V's true interest in target object Y, we have 
r(V, Y)-q(V, Y)+f(V, Y) and can estimate f(V, Y) as r(V, 
Y)-q(V, Y). Thus, the problem of estimating topical interest 
at all points becomes a problem of interpolating among these 35 
estimates of topical interest at selected points, such as the 
feedback estimate of f (V, Y) a s r(V, Y)-q(V, Y). This 
interpolation can be accomplished with any standard 
smoothing technique, using as input the known point esti- 
mates of the value of the topical interest function f(*, *), and 40 
determining as output a function that approximates the entire 
topical interest function f(*, *). 

Not all point estimates of the topical interest function f(*, 
*) should be given equal weight as inputs to the smoothing 
algorithm. Since passive relevance feedback is less reliable 45 
than active relevance feedback, point estimates made from 
passive relevance feedback should be weighted less heavily 
than point estimates made from active relevance feedback, 
or even not used at all. In most domains, a user's interests 
may change over time and, therefore, estimates of topic al 50 
interest that derive from more recent feedback should also 
be weighted more heavily. ^ user's interests may vary 
according to mood, so es tirnates of topical interest tha t 
derive from the current session should be weighted nao re 
heavily for the duration of the current session, and past 55 
estimates of topical i nterest made at approximately the 
^rrent time ot flay or on me current weekday shouM be 
weighted more beavily.\|F inally, in domains where users^are 
trying to locate target objects of long-term interest 
(investments, romantic partners, pen pals, employers, 60 
employees, suppliers, service providers) from the possibly 
meager information provided by the target profiles, the users 
are usually not in a position to provide reliable immediate 
feedback on a target object, but can provide reliable feed- 
back at a later date. An estimate of topical interest f(V, Y) 65 
should be weighted more heavily if user V has had more 
experience with a target object Y Indeed, a useful strategy 



is for the system to track long-term feedback for such target 
objects. For example, if target profile Y was created in 1990 
to describe a particular investment that was available in 
1990, and that was purchased in 1990 by user V, then the 
system solicits relevance feedback from user V in the years 
1990, 1991, 1992, 1993, 1994, 1995, etc., and treats these as 
successively stronger indications of user V's true interest in 
target profile Y, and thus as indications of user V's likely 
interest in new investments whose current profiles resemble 
the original 1990 investment profile Y, In particular, if in 
1994 and 1995 user V is well-disposed toward his or her 
1990 purchase of the investment described by target profile 
Y, then in those years and later, the system tends to recom- 
mend additional investments when they have profiles like 
target profile Y, on the grounds that they too will turn out to 
be satisfactory in 4 to 5 years. It makes these recommen- 
dations both to user V and to users whose investment 
portfolios and other attributes are similar to user Vs. The 
relevance feedback provided by user V in this case may be 
either active (feedback satisfaction ratings provided by the 
investor V) or passive (feedback=difference between aver- 
age annual return of the investment and average annual 
return of the Dow Jones index portfolio since purchase of the 
investment, for example). 

To effectively apply the smoothing technique, it is nec- 
essary to have a definition of the similarity distance between 
(U, X) and (V, Y), for any users U and V and any target 
objects X and Y We have already seen how to define the 
distance d(X, Y) between two target objects X and Y, given 
their attributes. We may regard a pair such as (U. X) as an 
extended object that bears aU the attributes of target X and 
all the attributes of user U; then the distance between (U, X) 
and (V, Y) may be computed in exactly the same way. This 
approach requires user U, user V, and all other users to have 
some attributes of their own stored in the system: for 
example, age (numeric), social security number (textual), 
and list of documents previously retrieved (associative). It is 
these attributes that determine the notion of "similar users," 
Thus it is desirable to generate profiles of users (termed 
"user profiles**) as well as profiles of target objects (termed 
"target profiles"). Some attributes employed for profiling 
users may be related to the attributes employed for profiling 
target objects: for example, using associative attributes, it is 
possible to characterize target objects such as X by the 
interest that various users have shown in them, and simul- 
taneously to characterize users such as U by the interest that 
they have shown in various target objects. In addition, user 
profiles may make use of any attributes that are usefull in 
characterizing humans, such as those suggested in the 
example domain above where target objects are potential 
consumers. Notice that user U's interest can be estimated 
even if user U is a new user or an off-line user who has never 
provided any feedback, because the relevance feedback of 
users whose attributes are similar to U*s attributes is taken 
into account. 

For some uses of filtering systems, when estimating 
topical interest, it is appropriate to make an additional 
"presumption of no topical interest" (or "bias toward zero"). 
To understand the usefulness of such a presumption, suppose 
the system needs to determine whether target object X is 
topically interesting to the user U, but that users like user U 
have never provided feedback on target objects even 
remotely like target object X. The presumption of no topical 
interest says that if this is so, it is because users like user U 
are simply not interested in such target objects and therefore 
do not seek them out and interact with them. On this 
presumption, the system should estimate topical interest f(U, 
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X) to be low. Formally, this example has the characteristic temporary or a permanent basis. However, it is often desir- 

that (U, X) is far away from all the points (V, Y) where able for the filtering system to learn attribute weights 

feedback is available. In such a case, topical interest f(U, X) automatically, based on relevance feedback. The optimal 

is presumed to be close to zero, even if the value of the attribute weights for a user U are those that allow the most 

topical interest function f(*,*) is high at all the faraway 5 accurate prediction of user U's interests. That is, with the 

surrounding points at which its value is known. When a distance measure and quality function defined by these 

smoothing technique is used, such a presumption of no attribute weights, user U's interest in target object X, q(U, 

topical interest can be introduced, if appropriate, by manipu- X)+f(U, X), can be accurately estimated by the techniques 

lating the input to the smoothing technique. In addition to above. The effectiveness of a particular set of attribute 

using observed values of the topical interest: function f(*, *) 10 weights for user U can therefore be gauged by seeing how 

as input, the trick is ,to also introduce fake observations of well it predicts user Us known interests, 

the form topical interest f(V, Y)=0 for a lattice of points (V, Formally, suppose that user U has previously provided 

Y) distributed throughout the multidimensional space. These feedback on target objects Xj, Xj, X3, . . . X„, and that the 

fake observations should be given relatively low weight as feedback ratings are r(U, X^, r(U, X^, r(U, X3), . . . r(U, 

inputs to the smoothing algorithm. The more strongly they 15 X„). Values of feedback ratings r(*,*) for other users and 

are weighted, the stronger the presumption of no interest. other target objects may also be known. ITie system may use 

The following provides another simple example of an the following procedure to gauge the effectiveness of the set 

estimation technique that has a presumption of no interest. of attribute weights it currently stores for user U: (1) For 

Let g be a decreasing function from non-negative real each l<=I<«=n, use the estimation techniques to estimate 

numbers to non-negative real numbers, such as g(x)=c^ or 20 q(U, Xi)+f(U, X,-) from all known values of feedback ratings 

g(x)=min(l, X"*) where k>l. Estimate topical interest f(U, r. Call this estimate a,-, (ii) Repeat step (i), but this time make 

X), with the following g-weighted average: the estimate for each l<=i<=n without using the feedback 

ratings r(U, Xj) as input, for any j such that the distance d(X,-, 

y Y) - q{V, Y)) *«(distance{y, X) A (V, Y)) ^j) ^ s«iaUer than a fixed threshold. That is, estimate each 

only; in particular, do not use r(U, X,) itself. Call this 
estimate b,-. The difference a,— b,- is herein termed the "resi- 

Here the summations are over all pairs (V, Y) such that due feedback r^^(U, XJ of user U 00 target object X,." (iii) 

user V has provided feedback r(V, Y) on target object Y i.e., Compute user U's error measure, i^x-h^\{di^-b^^-¥{2i^- 

all pairs (V, Y) such that relevance feedback r(V, Y) is 30 b3)^+ . . . +(a„-b„f . 

defined. Note that both with this technique and with con- A gradient-descent or other numerical optimization 

ventional smoothing techniques, the estimate of the topical method may be used to adjust user U's attribute weights so 

interest f(U, X) is not necessarily equal to r(U, X)q(U, X), that this error measure reaches a (local) minimum. This 

even when r(U, X) is defined. approach tends to work best if the smoothing technique used 

Filtering: Adjusting Weights and Residue Feedback 35 in estimation is such that the value of f(V, Y) is strongly 

The method described above requires the filtermg system affected by the point estimate r(V, Y)-q(V, Y) when the latter 

to measure distances between (user, target object) pairs, such value is provided as input. Otherwise, the presence or 

as the distance between (U, X) and (V, Y). Given the means absence of the single input feedback rating r(U, Xe), in steps 

described earlier for measuring the distance between two (i)-(ii) may not make a,- and b^ very different from each 

multi- attribute profiles, the method must therefore associate 40 other, A sUght variation of this learning technique adjusts a 

a weight with each attribute used in the profile of (user, single global set of at tribute weights for all users, by 

targelobject)pairs, that is, with each attribute used to profile adjusting the weights so as to minimize not a particular 

either users or target objects. These weights specify the user's error measure but rather the total error measure of all 

relative importance of the attributes in establishing similar- users. These global weights are used as a default initial 

ity or difference, and therefore, in determining how topical 45 setting for a new user who has not yet provided any 

interest is generalized from one (user, target object) pair to feedback. Gradient descent can then be employed to adjust 

another. Additional weights determine which attributes of a this user's individual weights over lime, 

target object contribute to the quality function q, and by how Even when the attribute weights are chosen to minimize 

much. the error measure for user U the error measure is generally 

It is possible and often desirable for a filtering system to 50 still positive, meaning that residue feedback ftx)m user U has 

store a different set of weights for each user. For example, not been reduced to 0 on all target objects. It is useful to note 

a user who thinks of two-star fihns as having materially that high residue feedback from a user U on a target object 

different topic and style from four-star films wants to assign X indicates that user U liked target object X unexpectedly 

a high weight to "number of stars" for purposes of the well given its profile, that is, better than the smoothing 

similarity distance measure d(*, *); this means that interest 55 model could predict from user U's opinions on target objects 

in a two-star film does not necessarily signal interest in an with similar profiles. Similarly, low residue feedback indi- 

otherwise similar four-star film, or vice- versa. If the user cates that user U Uked target object X less than was 

also agrees with the critics, and actually prefers four-star expected. By definition, this unexplained preference or 

films, the user also wants to assign "number of stars" a high dispreference cannot be the result of topical similarity, and 

positive weight in the determination of the quality function 60 therefore must be regarded as an indication of the intrinsic 

q. In the same way, a user who dislikes vulgarity wants to quality of target object X. It follows that a useful quality 

assign the "vulgarity score" attribute a high negative weight attribute for a target object X is the average amount of 

in the determination of the quality fiinction q, although the residue feedback r„j(V, X) from users on that target object, 

"vulgarity score" attribute does not necessarily have a high averaged over all users V who have provided relevance 

weight in determining the topical similarity of two films. 65 feedback on the target object. In a variation of this idea, 

Attribute weights (of both sorts) may be set or adjusted by residue feedback is never averaged indiscriminately over all 

the system administrator or the individual user, on either a users to form a new attribute, but instead is smoothed to 
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consider users* similarity to each other. Recall that the 
quality measure q(U, X) depends on the user U as well as the 
target object X, so that a given target object X may be 
perceived by different users to have different quality. In this 
variation, as before, q(U, X) is calculated as a weighted sum 5 
of various quaUty attributes that are dependent only on X, 
but then an additional terra is added, namely an estimate of 
r^„ (U, X) found by applying a smoothing algorithm to 
known values of r^, (V, X). Here V ranges over all users 
who have provided relevance feedback on target object X, 
and the smoothing algorithm is sensitive to the distances 
d(U, V) from each such user V to user U. 
Using the Similarity Computation for Clustering 

A method for defining the distance between any pair of 
target objects was disclosed above. Given this distance 
measure, it is simple to apply a standard clustering 
algorithm, such as k-means, to group the target objects into 
a number of clusters, in such a way that similar target objects 
tend to be grouped in the same cluster. It is clear that the 
resulting clusters can be used to improve the efficiency of 
matching buyers and sellers in the application described in 20 
section "Matching Buyers and Sellers" above: it is not 
necessary to compare every buy profile to every sell profile, 
but only lo compare buy profiles and sell profiles that are 
similar enough to appear in the same cluster. As explained 
below, the results of the clustering procedure can also be 25 
used to make filtering more efificieot, and in the service of 
querying and browsing tasks. 

The k-means clustering method is familiar to those skilled 
in the art. Briefly put, it finds a grouping of points (target 
profiles, in this case, whose numeric coordinates are given 3° 
by numeric decomposition of their attributes as described 
above) to minimize the distance between points in the 
clusters and the centers of the clusters in which they are 
located. This is done by alternating between assigning each 
point to the cluster which has the nearest center and then, ^5 
once the points have been assigned, computing the (new) 
center of each cluster by averaging the coordinates of the 
points (target profiles) located in this cluster. Other cluster- 
ing methods can be used, such as "soft or "fizzy" k-means 
clustering, in which objects are allowed to belong to more ^ 
than one cluster. This can be cast as a clustering problem 
similar to the k-means problem, but now the criterion being 
optimized is a little different; 

where C ranges over cluster numbers, i ranges over target 
objects, X,- is the numeric vector corresponding to the profile 
of target object number i, _js the mean of all the numeric 
vectors corresponding to target profiles of target objects in 
cluster number C, termed the "cluster profile" of cluster C, 50 
d(*, *) is the metric used to measure distance between two 
target profiles, and i^-^ is a value between 0 and 1 that 
indicates how much target object number i is associated with 
cluster number C, where i is an indicator matrix with the 
property that for each i, SUM SUB C I SUB iC=l. For ss 
k-means clustering, i^^-. is either 0 or 1. 

Any of these basic types of clustering might be used by 
the system: 

1) Association-based clustering, in which profiles contain 
only associative attributes, and thus distance is defined 60 
entirely by associations. Tliis kind of clustering gener- 
ally (a) clusters target objects based on the similarity of 
the users who like them or (b) clusters users based on 
the similarity of the target objects they like. In this 
approach, the system does not need any information 65 
about target objects or users, except for their history of 
interaction with each other. 
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2) Content-based clustering, in which profiles contain 
only non-associative attributes. This kind of clustering 
(a) clusters target objects based on the similarity of 
their non-associative attributes (such as word 
frequencies) or (b) clusters users base d on the simi- 
larity of their non-associative attributes (such as demo- 
graphics and psychographics). In this approach, the 
system does not need to record any information about 
users' historical patterns of information access, but it 
does need information about the intrinsic properties of 
users and/or target objects. 

3) Uniform hybrid method, in which profiles may contain 
both associative and non-associative attributes. This 
method combines la and 2fl, or lb and 2b. The distance 
d(P;ir, Py) between two profiles and Py may be 
computed by the general similarity-measurement meth- 
ods described earlier. 

4) Sequential hybrid method. First apply the k-means 
procedure to do la, so that articles are labeled by 
cluster based on which user read them, then use super- 
vised clustering (maximum likelihood discriminant 
methods) using the word frequencies to do the process 
of method 2a described above. This tries to use knowl- 
edge of who read what to do a better job of clustering 
based on word frequencies. One could similarly com- 
bine the methods lb and 2b described above. 

Hierarchical clustering of target objects is often useful. 
Hierarchical clustering produces a tree which divides the 
target objects first into two large clusters of roughly similar 
objects; each of these clusters is in turn divided into two or 
more smaller clusters, which in turn are each divided into yet 
smaller clusters until the collection of target objects has been 
entirely divided into "clusters" consisting of a single object 
each, as diagrammed in FIG. 8 In this diagram, the node d 
denotes a particular target object d, or equivalently, a single- 
member cluster consisting of this target object. Target object 
d is a member of the cluster (a, b, d), which is a subset of 
the cluster (a, b, c, d, e, f), which in turn is a subset of all 
target objects. The tree shown in FIG. 8 would be produced 
from a set of target objects such as those shown geometri- 
cally in FIG. 7. In FIG, 7, each letter represents a target 
object, and axes xl and x2 represent two of the many 
numeric attributes on which the target objects differ. Such a 
cluster tree may be created by hand, using human judgment 
to form clusters and subclusters of similar objects, or may be 
created automatically in either of two standard ways: top- 
down or bottom-up. In top-down hierarchical clustering, the 
set of all target objects in FIG. 7 would be divided into the 
clusters (a, b, c, d, e, Q and (g, h, i, j k). The clustering 
algorithm would then be reapplied to the target objects in 
each cluster, so that the cluster (g, h, i, j, k) is subpartitioned 
into the clusters (g, k) and (h, i, j), and so on to arrive at the 
tree shown in FIG. 8. In bottom-up hierarchical clustering, 
the set of all target objects in FIG. 7 would be grouped into 
numerous small clusters, namely (a, b), d, (c, f), e, (g^), (h, 
i), and j. These clusters would then themselves be grouped 
into the larger clusters (a, b, d), (c, e, f), (g, k), and (h, i, j), 
according to their cluster profiles. These larger clusters 
wotild themselves be grouped into (a, b, c, d, e, f) and (g, k, 
h, i, j), and so on until all target objects had been grouped 
together, resulting in the tree of FIG. 8. Note that for 
bottom-up clustering to work, it must be possible to apply 
the clustering algorithm to a set of existing clusters. This 
requires a notion of the distance between two clusters. The 
method disclosed above for measuring the distance between 
target objects can be applied directly, provided that clusters 
are profiled in the same way as target objects. It is only 
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necessary to adopt the convention that a cluster's profile is 
the average of the target profiles of all the target objects in 
the cluster; that is, to determine the cluster's value for a 
given attribute, take the mean value of that attribute across 
all the target objects in the cluster. For the mean value to be 5 
well-defined, all attributes must be numeric, so it is neces- 
sary as usual to replace each textual or associative attribute 
with its decomposition into numeric attributes (scores), as 
described earlier. For example, the target profile of a single 
Woody Allen film would assign "Woody-Allen" a score of 1 lO 
in the "name-of-director" field, while giving "Federico- 
Fcllini" and "Terence-Davies" scores of 0. A cluster that 
consisted of 20 films directed by Allen and directed by 
Fellini would be profiled with scores of 0,8, 0.2, and 0 
respectively, because, for example, 0.8 is the average of 20 15 
ones and 5 zeros. 
Searching for Target Objects 

Given a target object with target profile P, or alternatively 
given a search profile P, a hierarchical cluster tree of target 
objects makes it possible for the system to search efiBcicntly 20 
for target objects with target profiles similar to P. It is only 
necessarily to navigate through the tree, automatically, in 
search of such target profiles. The system for customized 
electronic identification of desirable objects begins by con- 
sidering the largest, top-level clusters, and selects the cluster 25 
whose profile is most similar to target profile P. In the event 
of a near-tie, multiple clusters may be selected. Next, the 
system considers all subclusters of the selected clusters, and 
this time selects the subcluster or subclusters whose profiles 
are closest to target profile P. This refinement process is 30 
iterated until the clusters selected on a given step are 
sufficiently small, and these are the desired clusters of target 
objects with profiles most similar to target profile P. Any 
hierarchical cluster tree therefore serves as a decision tree 
for identifying target objects. In pseudo-code form, this 35 
process is as follows (and in flow diagram form in FIGS, 
13A and 13B): 

1. Initialize list of identified target objects to the empty list 
at step 13A00 

2. Initialize the current tree T to be the hierarchical cluster 
tree of all objects at step 13A01 and at step 13A02 scan 
the current cluster tree for target objects similar to P 
using the process detailed in FIG. 13B. At step 13A03, 
the list of target objects is returned. 

3. At step 13B00, the variable I is set to 1 and for each 
child subtree Ti of the root of tree T, is retrieved. 

4. At step 13B02, calculate d(P, p the similarity distance 
between P and p^, 

5. At step 13B03, if d(P, p,)<t, a threshold, branch to one 50 
of two options 

6. If tree Ti contains only one target object at step 13B04, 
add that target object to list of identified target objects 
at step 13B05 and advance to step 13B07. 

7. If tree Ti contains multiple target objects at step 13B04, 55 
scan the ith child subtree for target objects similar to P 
by invoking the steps of the process of FIG. 13B 
recursively and then recurse to step 3 (step 13A01 in 
FIG, 13 A) with T bound for the duration of the recur- 
sion to tree Ti, in order to search in tree Ti for target 60 
objects with profiles similar to P. 

In step 5 of this pseudo-code, smaller thresholds are 
typically used at lower levels of the tree, for example by 
making the threshold an afiSne function or other function of 
the cluster variance or cluster diameter of the cluster p^. If 65 
the cluster tree is distributed across a plurality of servers, as 
described in the section of this description titled "Network 
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Context of the Browsing System", this process may be 
executed in distributed fashion as follows; steps 3-7 are 
executed by the server that stores the root node of hierar- 
chical cluster tree T, and the recursion in step 7 to a 
subcluster tree T,- involves the transmission of a search 
request to the server that stores the root node of tree T,., 
which server carries out the recursive step upon receipt of 
this request. Steps 1-2 arc carried out by the processor that 
initiates the search, and the server that executes step 6 must 
send a message identifying the target object to this initiating 
processor, which adds it to the list. 

Assuming that low-level clusters have been already been 
formed through clustering, there are alternative search meth- 
ods for identifying the low-level cluster whose profile is 
most similar to a given target profile P. A standard back- 
propagation neural net is one such method: it should be 
trained to take the attributes of a target object as input, and 
produce as output a unique pattern that can be used to 
identify the appropriate low-level cluster. For maximima 
accuracy, low-level clusters that are similar to each other 
(close together in the cluster tree) should be given similar 
identifying patterns. Another approach is a standard decision 
tree that considers the attributes of target profile P one at a 
time until it can identify the appropriate cluster. If profiles 
are large, this may be more rapid than considering all 
attributes. A hybrid approach to searching uses distance 
measurements as described above to navigate through the 
top few levels of the hierarchical cliister tree, until it reaches 
an cluster of intermediate size whose profile is similar to 
target profile P, and then continues by using a decision tree 
specialized to search for low-level subclusters of that inter- 
mediate cluster. 

One use of these searching techniques is to search for 
target objects that match a search profile from a user's search 
profile set. This form of searching is used repeatedly in the 
news clipping service, active navigation, and Virtual Com- 
munity Service applications, described below. Another use is 
to add a new target object quickly to the cluster tree. An 
existing cluster that is similar to the new target object can be 
located rapidly, and the new target object can be added to 
this cluster. If the object is beyond a certain threshold 
distance from the cluster center, then it is advisable to start 
a new cluster. Several variants of this incremental clustering 
scheme can be used, and can be built using variants of 
subroutines available in advanced statistical packages. Note 
thai various methods can be used to locate t he new target 
objects that must-be added to the cluster tree, depending on 
the architecture used. In one method, a "webcrawler*' pro- 
gram running on a central computer periodically scans all 
servers in search of new target objects, calculates the target 
profiles of these objects, and adds them to the hierarchical 
cluster tree by the above method. In another, whenever a 
new target object is added to any of the servers, a software 
"agent" at that server calculates the target profile and adds 
it to the hierarchical cluster tree by the above method. 
Rapid Profiling 

In some domains, complete profiles of target objects are 
not always easy to construct automatically. When target 
objects are multi-media games e.g., an attribute such as 
genre (a single textual term such as "action", * suspense/ 
thriller*', "word games", etc.) may be a matter of judgment 
and opinion. More significantly, if each title has an associ- 
ated attribute that records the positive or negative relevance 
feedback to that title from various human users (consumers), 
then all the association scores of any newly introduced titles 
are initially zero, so that it is initially unclear what other 
dtles are similar to the new title with respect to the users who 
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like them. Indeed, if this associative attribute is highly small number of significant target objects, and perhaps also 

weighted, the initial lack of relevance feedback information by determining a small n umber of other key attributes of the 

may be difficult to remedy, due to a vicious circle in which new user, by on-line queries, telephone surveys, or other 

users of moderate-to -high interest are needed to provide means. Once the new user has been partially profiled in this 

relevance feedback but relevance feedback is needed to 5 way, the methods disclosed above predict that the new user's 

identify users of mode rate -to -high interest. interests resemble the known interests of other users with 

Fortunately, however, it is often possible in principle to similar profiles |(liln^ variation; each user's user profile i s 

determine certain attributes of a new target object by su bdivided into a set of long-term attributes, such as dem o- 

extraordinary methods, including but not limited to methods g gphic characteristics, and a set of short-term attributes tha t 

that consult a human. For example, the system can in lO help to id entify the user's temporary desires and emotio nal 

principle determine the genre of a title by consulting one or staie« su cn as tbe user's texttia l or multiple-choice answer s 

more randomly chosen individuals from a set of known t o questions w hos e answers reflect the use r's mood. A subset 

human experts, while to determine the numeric association o f tEe use PFlbng-term attributes are determined when t he^ 

score between a new title and a particular user, it can in user first registers with the system, tfarough tbe use ot a rap id 

principle show the title to the that user and obtain relevance 15 " jrofiling tree of long-term attributes. In addition, each tim e 

feedback. Since such requests inconvenience people, t he user logs on to the system, a subset ot tne use r's 

however, it is important not to determine all difficult short-term attributes are additionally determined,~throu gh 

attributes this way, but only the ones that are most important t he use of a separate rapid profiling tree that asks a 5out 

in classifying the article. "Rapid profiling" is a method for s Eort-tcrm attribu tes. 

selecting those numeric attributes that arc most important to 20 Market Research V^V^ C--*^ 

determine. (Recall that all attributes can be decomposed into A technique similar to rapid profiling is of interest in 

numeric attributes, such as association scores or term market research (or voter research). Suppose that the target 

scores.) First, a set of existing target objects that already objects are consumers. A particular attribute in each target 

have complete or largely complete profiles are clustered profile indicates whether the consumer described by that 

using a k-means algorithm. Next, each of the resulting 25 target profile h as purchased product X. A decision tree can 

clusters is assigned a unique identifying number, and each be built that attempts to determine what value a consumer 

clustered target object is labeled with the identifying number has for this attribute, by consideration of the other attributes 

of its cluster. Standard methods then allow construction of a in the consumer's profile. This decision tree may be tra- 

single decision tree that can determine any target object's versed to determine whether additional users are likely to 

cluster number, with substantial accuracy, by considering 30 purchase product X. More generally, the top few levels of 

the attributes of the target object, one at a time. Only the decision tree provide information, valuable to advertisers 

attributes that can if necessary be determined for any new who are planning mass-market or direct-mail campaigns, 

target object are used in the construction of this decision about the most significant characteristics of consumers of 

tree. To profile a new target object, the decision tree is product X. 

traversed downward from its root as far as is desired. The 35 Similar information can alternatively be extracted from a 
root of the decision tree considers some attribute of the collection of consumer profiles without recourse to a deci- 
target object. If the value of this attribute is not yet known, sion tree, by considering attributes one at a time, and 
it is determined by a method appropriate to that attribute; for identifying those attributes on which product X's consumers 
example, if the attribute is the association score of the target differ significantly from its non-consumers. These tech- 
object with user #4589, then relevance feedback (to be used 40 niques serve to characterize consumers of a particular prod- 
as the value of this attribute) is solicited from user #4589, uct; they can be equally well applied to voter research or 
perhaps by the ruse of adding the possibly uninteresting other survey research, where the objective is to characterize 
target object to a set of objects that the system recommends those individuals from a given set of surveyed individuals 
to the user's attention, in order to find out what the user who favor a particular candidate, hold a particular opinion, 
thinks of it. Once the root attribute is determined, the rapid 45 belong to a particular demographic group, or have some 
profiling method descends the decision tree by one level, other set of distinguishing attributes. Researchers may -wish 
choosing one of the decision subtrees of the root in accor- to purchase batches of analyzed or unanalyzed user profiles 
dance with the determined value of the root attribute. The from which personal identifying information has been 
root of this chosen subU-ee considers another attribute of the removed. As with any statistic^ database, statistical conclu- 
target object, whose value is likewise determined by an 50 sions can be drawn, and relationships between attributes can 
appropriate method. The process c an be repeated to deter- be elucidated using knowledge discovery techniques which 
mine as many attributes as desired, by whatever methods are are well known in the art. 
available, although it is ordinarily stopped after a small <;t rPPORTf Ma ARPHrTFrmmF 
number of attributes, to avoid the burden of determining too J^UFFUKllNO AKLHllbCl UKb 
many attributes. 55 The following section describes the preferred computer 
It should be noted that the rapid profiling method can be and network architectwe for implementing the methods 
used to identify important attributes in any sort of profile, described in this patent, 
and not just profiles of target objects. In particular, recall that Electronic Media System Architecture 
the disclosed method for determining topical interest FIG. 1 illustrates in block diagram form the overall 
through similarity requires users as well as target objects to 60 architecture of an electronic media system, known in the art, 
have profiles. New users, like new target objects, may be in which the system for customized electronic identification 
profiled or partially profiled through the rapid profiling of desirable objects of the present invention can be used to 
process. For example, when user profiles include an asso- provide user customized access to target objects that are 
ciative attribute that records the user's relevance feedback available via the electronic media system. In particular, the 
on all target objects in the system, the rapid profiUng 65 elertronic media system comprises a data communication 
procedure can rapidly form a rough characterization of a fadUty that interconnects a plurality of users with a number 
new users interests by soliciting the user's feedback on a of information servers. The users are typically individuals. 
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whose personal computers (tenninals) Tj-T^ are connected 
via a data communications link, such as a modem and a 
telephone cormection established in well-knowD fashion, to 
a telecommunication network N. User information access 
software is resident on the user's personal computer and 
serves to communicate over the data communications link 
and the telecommunication network N with one of the 
plurality of network vendors V^-Vj^ (America Online, 
Prodigy, CompuServe, other private companies or even 
universities) who provide data interconnection service with 
selected ones of the information servers Ij-I^. The user can, 
by MSG of the user information access software, interact with 
the information servers I^-I^ to request and obtain access to 
data that resides on mass storage systems -SS^ that are part 
of the information server apparatus. New data is input to this 
system y users via their personal computers T^-T^ and by 
commercial information services by populating their mass 
storage systems SS^ -SS^ With commercial data. Each user 
terminal T^-T^ and the information servers Ij-I^ have 
phone numbers or IP addresses on the network N which 
enable a data communication link to be established between 
a particular user terminal Ti-T„ and the selected information 
server I^-I^. A user's electronic mail address also uniquely 
identifies the user and the user' network vendor Vj-Vj^ in an 
industry-standard format such as: username@aoLcom or 
usemame@netcom.com. The network vendors VI- V;^ pro- 
vide access passwords for their subscribers, (selected users), 
through which the users can access the information servers 
The subscribers pay the network vendors Vl-V^ for 
the access services on a fee schedule that typicaEy includes 
a monthly subscription fee and usage based charges. 

A difficulty with this system is that there are numerous 
information servers l^-I^ located around the world, each of 
which provides access to a set of information of differing 
format, content and topics and via a cataloging system that 
is typically unique to the particular information server Ij-I^. 
The information is comprised of individual "files," which 
can contain audio data, video data, graphics data, text data, 
structured database data and combinations thereof. In the 
terminology of this patent, each target object is associated 
with a unique file: for target objects that are informational in 
nature and can be digitaEy represented, the file directly 
stores the informational content of the target object, while 
for target objects that are not stored electronically, such as 
purchasable goods, the file contains an identifying, descrip- 
tion of the target object. Target objects stored electronically 
as text files can include commercially provided news 
articles, published documents, letters, user-generated 
documents, descriptions of physical objects, or combina- 
tions of these classes of data. The organization of the files 
containing the information and the native format of the data 
contained in files of the same conceptual type may vary by 
informatiori server 

Thus, abuser can have difficulty in locating files that 
contain the desired information, because the information 
may be contained in files whose information server catalog- 
ing may not enable the user to locate them. Furthermore, 
there is no standard catalog that defines the presence and 
services provided by all information servers li-l^. A user 
therefore does not have simple access to information but 
must expend a significant amount of time and energy to 
excerpt a segment of the information that may be relevant to 
the user from the plethora of information that is generated 
and populated on this system. Even if the user commits the 
necessary resources to this task, existing information 
retrieval processes lack the accuracy and efficiency to ensure 
that the user obtains the desired information. It is obvious 
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that within the constructs of this electronic media system, 
the three modules of the system for customized electronic 
identification of desirable objects can be implemented in a 
distributed marmer, even with various modules being imple- 

5 mented on and/or by different vendors within the electronic 
media system. For example, the information servers Ij-I^ 
can include the target profile generation module while the 
network vendors Vj-Vj^. may implement the user profile 
generation module, the target profile interest simimary gen- 

10 eration module, and/or the profile processing module. A 
module can itself be implemented in a distributed manner, 
with numerous nodes being present in the network N, each 
node serving a population of users in a particular geographic 
area. The totaUty of these nodes comprises the functionality 

15 of the particular module. Various other partitions of the 
modules and their functions are possible and the examples 
provided herein represent illustrative examples and arc not 
intended to limit the scope of the claimed invention. For the 
purposes of pseudonymous creation and update of users* 

20 target profile interest summaries (as described below), the 
vendors V^-Vj^ may be augmented with some number of 
proxy servers, which provide a mechanism for ongoing 
pseudonymous access and profile building through the 
method described herein. At least one trusted validation 

25 server must be in place to administer the creation of pseud- 
onyms in the system. 

An important characteristic of this system for customized 
electronic identification of desirable objects is its 
responsiveness, since the intended use of the system is in an 

30 interactive mode. The system utiUty grows with the number 
of the users and this increases the number of possible 
consumer/product relationships between users and target 
objects. A system that serves a large group of users must 
maintain interactive performance and the disclosed method 

35 for profiling and clustering target objects and users can in 
turn be used for optimizing the distribution of data among 
the members of a virtual community and through a data 
communications network, based on users' target profile 
interest summaries. 

40 Network Elements and System Characteristics 

The various processors interconnected by the data com- 
munication network N as shown in FIG. 1 can be divided 
into two classes and grouped as illustrated in FIG. 2: clients 
and servers. The clients Cl-Cn are individual user's com- 

45 puter systems which are connected to servers S1^5 at 
various times via data, communications links. Each of the 
clients Ci is typically associated with-a single server Sj, but 
these associations can change over time. The clients Cl-Co 
both interface with users and produce and retrieve files to 

50 and from servers. The cfients Cl-Cn are not necessarily 
continuously on-hne, since they typically serve a single user 
and can be movable systems, such as laptop computers, 
which can be connected to the data communications network 
N at any of a number of locations. Clients could also be a 

55 variety of other computers, such as computers and kiosks 
providing access to customized information as well as 
targeted advertising to many users, where the users identffy 
themselves with passwords or with smart cards. A server Si 
is a computer system that is presumed to be continuously 

60 on-line and functions to both collect files from various 
sources on the data communication network N for access by 
local chenls Cl-Cn and collect files from local clients 
Cl-Cn for access by remote chents. The server Si is 
equipped with persistent storage, such as a magnetic disk 

65 data storage medium, and are interconnected with other 
servers via data communications links. The data communi- 
cations links can be of arbitrary topology and architecture, 
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and are described herein for the purpose of simplicity as 
point-to-point links or, more precisely, as virtual point-to- 
point links. The servers S1-S5 comprise the network ven- 
dors VI- Vk as well as the information servers Ii-l^ of FIG. 
1 and the functions performed by these two classes of 
modules can be merged to a greater or lesser extent in a 
single server Si or distributed over a number of servers in the 
data communication network N. Prior to proceeding with the 
description of the preferred embodiment of the invention, a 
number of terms are defined. FIG. 3 illustrates in block 
diagram form a representation of an arbitrarily selected 
network topology for a plurality of servers A-D, each of 
which is interconnected to at least one other server and 
typically also to a plurality of clients p-s. Servers A-D are 



10 



ing ignorant of the users^ true identities, so that users can 
keep their purchases or preferences private. A second and 
equally important requirement of a pseudonym system is 
that it provide for digital credentials, which are used to 
guarantee that the user represented by a particular pseud- 
onym has certain properties. These credentials may be 
granted on the basis of result of activities and transactions 
conducted by means of the system for customized electronic 
identification of desirable objects, or on the basis of other 
activities and transactions conducted on the network N of 
the present system, on the basis of users' activities outside 
of network N. For example, a service provider may require 
proof that the purchaser has sufficient funds on deposit at 
his/her bank, which might possibly not be on a network, 



interconnected by a collection of point to point data com- 15 before agreeing to travel busmess with that iiser The user 
munications Unks, and server A is connected to client r; ^^^^J^' Prf/de the service provider wi h proof of 
server B is connected to clients while server D is ^^^s (a credcnUal) from the bank, while still not disclosing 
connected to client s. Servers transmit encrypted or unen- * identity to the servi^ provider, 

crypted messages amongst themselves: a message typicaUy Our method solves the above problems by combmmg the 
contains the textual and/or graphic information stored in a 20 f^^^°^y^ '"n "f''''^^,^^'^^" ^^^^°fl '^"Sht 

particular file, and also contains data which describe the type by p. Chaum and J. H. Evertse, in the paper tided Asecure 
andoriginofthisfllcthenameoftheserverthatissupposed pnvacy-protecting protocol for transmitting personal 

to receive the message, and the purpose for which the file mformaUon between orgamzaUons, with the implementa- 

tion of a set of one or more proxy servers distributed 



contents are being transmitted. Some messages are not . . , . ,^ . r i 

associated with any file, but are sent by one server to other 25 ?""gbout the network N. Each proxy server, for examp e 



servers for control reasons, for example to request transmis- 
sion of a file or to announce the availability of a new file. 
Messages can be forwarded by a server to another server, as 
in the case where server A transmits a message to server D 
via a relay node of either server C or servers B, C. It is 30 
generally preferable to have multiple paths through the 
network, with each path being characterized by its perfor- 
mance capability and cost to enable the network N to 
optimize traffic routing. 

Proxy Servers and Pseudonymous Transactions 35 

While the method of using target profile interest summa- 
ries presents many advantages t o both target object provid- 
ers and users, there are important privacy issues for both 
users and providers that must be resolved if the system is to 
be used: freely and without inhibition by users without fear 40 
of invasion of privacy. It is likely that user s desire that some, 
if not all, of the user-spedfic information in their user 
profiles and target profile interest summaries remain 
confidential, to be disclosed only under certain circum- 
stances related to certain types of transactions and according 45 
to their personal wishes for differing levels of confidentiality 
regarding their purchases and expressed interests. 

However, complete privacy and inaccessibiUty of user 
transactions and profile summary information would hinder 
implementation of the system for customized electronic 50 
identification of desirable objects and would deprive the user 
of many of the ad vantages derived through the system's use 
of user-specific information. In many cases, complete and 
total privacy is not desired by all parties to a transaction. For 
example, a buyer may desire to be targeted for certain 55 
mailings that describe products that are related to his or her 
interests, and a seller may desire to target users who are 
predicted to be interested in the goods and services that the 
seller provides. Indeed, the usefulness of the technology 
described herein is contingent upon the ability of the system 60 
to collect and compare data about many users and many 
target objects. A compromise between total user anonymity 
and total public disclosure of the user's search profiles or 
target profile interest summary is a pseudonym. A pseud- 
onym is an artifact that allows a service provider to com- 65 
municatc with users and buUd and accumulate records of 
their preferences over time, while at the same time remain- 



Si in FIG. 2, is a server which communicates with clients 
and other servers Sin the network either directly or through 
anonymizing mix paths as detailed in the paper by D. Chaum 
titled "Untraceable Electronic Mail, Return Addresses, and 
Digital Pseudonyms," published in Communications of the 
ACM, Volume 24, Number 2, February 1981. Any server in 
the network N may be configured to act as a proxy server in 
addition to its other functions. Each proxy server provides 
service to a set of users, which set is termed the "user base" 
of that proxy server. A given proxy server provides three 
sorts of service to each user U in its user base, as follows: 

1. The first function of the proxy server is to bidircction- 
ally transfer communications between user U and other 
entities such as information servers (possibly including 
the proxy server itself) and/or other users. Specifically, 
letting S denote the server that is directly associated 
with user U's client processor, the proxy server com- 
municates with server S (and thence with user U), 
either through anonymizing mix paths that obscure the 
identity of server S and user U, in which case the proxy 
server knows user U only through a secure pseudonym, 
or else through a conventional virtual point-to-point 
connection, in which case the proxy server knows user. 
U by user Us address at server S, which address may be 
regarded as a non-secure pseudonym for user U. 

2. A second function of the proxy server is to record 
user-specific information associated with user U. This 
user-specific information includes a user profile and 
target profile interest summary for user U, as well as a 
list of access control instructions specified by user U, as 
described below, and a set of one-time- return addresses 
provided by user U that can be used to send messages 
to user U without knowing user U's true identity. All of 
this user-specific information is stored in a database 
that is keyed by user Us pseudonym (whether secure or 
non-secure) on the proxy server. 

3. A third function of the proxy server is to act as a 
selective forwarding agent for unsolicited communica- 
tions that are addressed to user U: the proxy server 
forwards some such communications to user U and 
rejects others, in accordance with the access control 
instructions specified by user U. 
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Our combined method allows a givea user to use either a 
single pseudonym in aU transactions where he or she wishes 
to remain pseudonymous, or else different pseudonyms for 
different types of transactions. In the latter case, each service 
provider might transact with the user under a different 
pseudonym for the user. More generally, a coalition of 
service providers, all of whom match users with the same 
genre of target objects, might agree to transact with the user 
using a common pseudonym, so that the target profile 



sends a control message to a proxy server under a given 
pseudonym, the proxy server uses the pseudonym's pubHc 
key to verify that the message has been digitally signed by 
someone who knows the pseudonym's private key. This 
prevents other parties from masquerading as the user. 

Our approach, as disclosed in this application, provides an 
improvement over the prior art in-privacy-protccted pseud- 
onymy for network subscribers such as taught in U.S. Pat. 
No. 5,245,656, which provides for a name translator station 



interest summary associated with that pseudonym would be lO to act as an intermediary between a service provider and the 
complete with respect to said genre of target objects. When user. However, while U.S. Pat. No. 5,245,656 provides that 
a user employs several pseudonyms in order to transact with the information transmitted between the end user U and the 
different coalitions of service providers, the user may freely service provider be doubly encrypted, the fact that a rela- 
choose a proxy server to service each pseudonym; these tionship exist s between user U and the service provider is 
proxy servers may be the same or different. From the service 15 known to the name translator, and this fact could be used to 
provider's perspective, our system provides security, in that compromise user U, for example if the service provider 
it can guarantee that users of a service are legitimately specializes in the provision of content that is not deemed 
entitled to the services used and that no user is using acceptable by user U's peers. The method of U.S. Pat. No. 
multiple pseudonyms to communicate with the same pro- 5,245,656 also omits a method for the convenient updating 
vider. This uniqueness of pseudonyms is important for the 20 of pseudonymous user profile information, such as is pro- 
purposes of this application, since the transaction informa- vided in this application, and does not provide for assurance 
tion gathered for a given individual must represent a com- of unique and credentialed registration of pseudonyms from 
plete and consistent picture of a single user's activities with a credentialing agent as is also provided in this application, 
respect to a given service provider or coalition of service and does not provide a means of access control to the user 
providers; otherwise, a user's target profile interest summary 25 based on profile information and conditional access as will 
and user profile would not be able to represent the user's be subsequently described. The method described by Loeb et 
interests to other parties as completely and accurately as al also does not describe any provision for credentials, such 
possible. as might be used for authenticating a, user's right to access 
The service provider must have a means of protection particular target objects, such as target objects that are 
from users who violate previously agreed upon terms of 30 intended to be available only upon payment of a subscription 
service. For example, if a user that uses a given pseudonym fee, or target objects that are intended to be unavailable to 
engages in activities that violate the terms of service, then younger users, 
the service provider should be able to take action against the Proxy Server Description 

user, such as denying the user service and blacklisting the In order that a user may ensure that some or all of the 
user from transactions with other parties that the user might 35 information in the user's user profile and target profile 
be tempted to defraud. This type of situation might occur interest summary remain dissociated from the user's true 
when a user employs a service provider for illegal activities identity, the user employs as an intermediary any one of a 
or defaults in payments to the service provider. The method number of proxy servers available on the data communica- 
of the paper titled "Security without identification: Trans- tion network N of FIG. 2 (for example, server S2). The 
action systems to make Big-Brother obsolete", published in 40 proxy servers function to disguise the true identity of the 
the Communications of the ACM, 28(10), October 1985; user from other parties on the data communication network 
pp. 1030-1 044, incorporated herein, provides for a mecha- N. The proxy server represents a given user to either single 
nism to enforce protection against this type of behavior network vendors and information servers or coalitions 
through the use of resolution credentials, which are creden- thereof. A proxy server, e.g. S2, is a server computer with 
lials that are periodically provided to individuals contingent 45 CPU, main memory, secondary disk storage and network 
upon their behaving consistent with the agreed upon terms communication function and with a database function which 
of service between the user and information provider and retrieves the target profile interest summary and access 
network vendor entities (such as regular payment for ser- control instructions associated with a particular pseudonym 
vices rendered, civil conduct, etc.). For the user's safety, if P, which represents a particular user U, and performs 
the issuer of a resolution credential refuses to grant this 50 bi-direaional routing of commands, target objects and bill- 
resolution credential to the user, then the refusal may be ing information between the user at a given client (e.g. C3) 
appealed to an adjudicating third party. The integrity of the and other network entities such as network vendors Vl-Vk 
user profiles and target profile interest summaries stored on and information servers Il-Im. Each proxy server maintains 
proxy servers is important: if a seller relies on such user- an encrypted target profile interest summary associated with 
specific information to deliver promotional offers or other 55 each allocated pseudonym in its pseudonym database D. The 
material to a particular class of users, but not to other users, actual user-specific information and the associated pseud- 
then the user-specific information must be accurate and onyms need not be stored locally on the proxy server, but 
untampered with in any way. The user may Ukewise wish to may alternatively be stored in a distributed fashion and be 
ensure that other parties not tamper with the user's user remotely addressable from the proxy server via point-to- 
profile and target profile interest summary, since such modi- 60 point connections. 

fication could degrade the system's ability to match the user The proxy server supports two types of bi-directional 

with the most appropriate target objects. This is done by connections: point-to-point connections and pseudonymous 

providing for the user to apply digital signatures to the connections through mix paths, as taught by P.Chaum in the 

control messages sent by the user to the proxy server Each paper titled "Untraceable Electronic Mail, Return 

pseudonym is paired with a public cryptographic key and a 65 Addresses, and Digital Pseudonyms", Communications of 

private cryptographic key, where the private key is known the ACM, Volume 24, Number 2, February 1981. The 

only to the user who holds that pseudonym; when the user normal connections between the proxy server and informa- 
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tion servers, for example a is connection between proxy 
server S2 and information server S4 in FIG. 2, are accom- 
plished through the point-to-point connection protocols pro- 
vided by network N as described in the "Electronic Media 
System Architecture" section of this application. The normal 
type of point-to-point connections may be used between 
S2-S4, for example, since the dissociation of the user and 
the pseudonym a need only occur between the client C3 and 
the proxy server S2, where the pseudonym used by the user 
is available. Knowing that an information provider such as 
S4 communicates with a given pseudonym P on proxy server 
S2 does not compromise the true identity of user U. The 
bidirectional connection between the user and the proxy 
server S2 can also be a normal point-to-point connection, but 
it may instead be made anonymous and secure, if the user 
desires, though the consistent use of an anonymizing mix 
protocol as taught by D.Chaum in the paper titled "Untrace- 
able Electronic Mail, Return Addresses, and Digital 
Pseudonyms", Communications of the ACM, Volume 24, 
Number 2, February 1981. This mix procedure provides 
untraceable secure anonymous mail between to parties with 
blind return addresses through a set of forwarding and return 
mounting servers termed "mixes" The mix routing protocol, 
as taught in the Chaum paper, is used with the proxy server 
S2 to provide a registry of persistent secure pseudonyms that 
can be employed by users other than user U, by information 
providers Il-Im, by vendors Vl-Vk and by other proxy 
servers to communicate with the users in the proxy server's 
user base on a continuing basis. The security provided by 
this mix path protocol is distributed and resistant to traflSc 
analysis attacks and other known forms of analysis which 
may be used by malicious parties to try and ascertain the true 
identity of a pseudonym bearer. Breaking the protocol 
requires a large number of parties to maliciously collude or 
be cryptographically compromised. In addition an extension 
to the method is taught where the user can include a return 
path definition in the message so the information server S4 
can return the requested information to the user*s client 
processor C3 We utilize this feature in a novel fashion to 
provide for access and reachabiUty control under user and 
proxy server control. 

Validation and Allocation of a Unique Pseudonym 

Chaum 's pseudonym and credential issuance system, as 
described in a publication by D. Chaum and J. H. Evertse, 
titled "A secure and privacy -protecting protocol for trans- 
mitting personal information between organizations," has 
several desirable properties for use as a component in our 
system. The system allows for individuals to use different 
pseudonyms with different organizations (such as banks and 
coalitions of service providers). The organizations which are 
presented with a pseudonym have no more information 
about the individual ttian the pseudonym itself and a record 
of previous transactions carried out under that pseudonym 
Additionally, credentials, which represent facts about a 
pseudonym that an organization is willing to certify, can be 
granted to a particular pseudonym, and transferred to other 
pseudonyms that the same user employs. For, example, the 
user can use different pseudonyms with different organiza- 
tions (or disjoint sets of organizations), yet still present 
credentials that were granted by one organization, under one 
pseudonym, in order to transact with another organization 
under another pseudonym, without revealing that the two 
pseudonyms correspond to the same user. Credentials may 
be granted to provide assurances regarding the pseudonym 
bearer's age, financial status, legal status, and the like. For 
example, credentials signifying "legal adult" may be issued 
to a pseudonym based on information known about the 
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corresponding user by the given is suing organization. Then, 
when the credential is transferred to another pseudonym that 
represents the user to another disjoint organization, presen- 
tation of this credential on the other pseudonym can be taken 

5 as proof of legal adulthood, which might satisfy a condition 
of terms of service. Credential -issuing organizations may 
also certify particular facts about a users demographic 
profile or target profile interest summary, for example by 
granting a credential that asserts "the bearer of this pseud- 

10 onym is either well-read or is middle-aged and works for a 
large company"; by presenting this credential to another 
entity, the user can prove eligibility for (say) a discount 
without revealing the user's personal data to that entity. 
Additionally, the method taught by Chaum provides for 

15 assurances that no individual may correspond with a given 
organization or coaKtion of organizations using more than 
one pseudonym; that credentials may not be feasibly forged 
by the user; and t hat credentials may not be transferred from 
one user's pseudonym to a different user's pseudonym. 

20 Finally, the method provides for expiration of credentials 
and for the issuance of "black marks" against Individuals 
who do not act according to the terms of service that they are 
extended. This is done through the resolution credential 
mechanism as described in Chaum's work, in which reso- 

25 lutions are issued periodically by organizations to pseud- 
onyms that are in good standing. If a user is not issued this 
resolution credential by a particular organization or coalition 
of organization, then this user cannot have it available to be 
transferred to other pseudonyms which he uses with other 

30 organizations. Therefore, the user cannot convince these 
other organizations that he has acted accordance with terms 
of service in other dealing^. If this is the case, then the 
organization can use this lack of resolution credential to 
infer that the user is not in good standing in his other 

35 dealings. In one approach organizations (or other users) may 
issue a list of quality related credentials based upon the 
experience of transaction (or interaction) with the user 
which may act similarly to a letter of recommendation as in 
a resume. If such a credential is issued from multiple 

40 organizations, their values become averaged. In an alterna- 
tive variation organizations may be issued credentials from 
users such as customers which may be used to indicate to 
other future users quality of service which can be expected 
by subsequent users on the basis of various criteria. 

45 In^our implementation, a pseudonym is a data record 
consisting of two fields. The first field specifies the address 
of the proxy server at which the pseudonym is registered. 
The second field contains a unique string of bits (e.g., a 
random binary number) that is associated with a particular 

50 user; credentials take the form of public-key digital signa- 
tures computed on this number, and the number itself is 
issued by a pseudonym administering server Z, as depicted 
in FIG. 2, and detailed I n a generic form in the paper by D. 
Chaum and J. H. Evertse, titled "A secure and privacy- 

55 protecting-protocol for transmitting personal information 
between organizations.". It is possible to send information to 
the user holding a given pseudonym, by enveloping the 
information in a control message that specifies the pseud- 
onym and is addressed to the proxy server that is named in 

60 the first field of the pseudonym; the proxy server may 
forward the information to the user upon receipt of the 
control message. 

While the user may iise a single pseudonym for all 
transactions, in the more general case a user has a set of 

65 several pseudonyms, each of which represents the user in his 
or her interactions with a single provider or coalition of 
service providers. Each pseudonymin the pseudonym set is 
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designated for transactions with a di fife rent coalition of 
related service providers and the pseudonyms used with one 
provider or coalition of providers cannot be linked to the 
pseudonyms used with other disjoint coalitions of providers. 
All of the user*s transactions with a given coalition can be 5 
linked by virtue of the fact that they are conducted under the 
same pseudonym, and therefore can be combined to define 
a unified picture, in the form of a user profile and a target 
profile interest summary, of the user's interests vis-a-vis the 
service or services provided by said coalition. There are 10 
other circumstances for which the use of a pseudonym may 
be useful and the present description is in no way intended 
to limit the scope of the claimed invention for example, the 
previously described rapid profiling tree could be used to 
pseudonymously acquire information about the user which 15 
is considered by the user to be sensitive such as that 
information which is of interest to such entities as insurance 
companies, medical speciaHsts, family counselors or dating 
services. 

Detailed Protocol 20 

In our system, the organizations that the iiscr U interacts 
with are the servers Sl-Sn on the network N. However, 
rather than directly corresponding with each server, the user 
employs a proxy server, e.g. S2, as an intermediary between 
the local server of the user s own client and the information 25 
provider or network vendor. Mix paths as described by 
D.Chaum in the paper titled "Untraceable Electronic Mail, 
Return Addresses, and Digital Pseudonyms", Communica- 
tions of the ACM, Volume 24, Number 2, Febniary 1981 
allow for untraceability and security between the client, such 30 
as C3, and the proxy server, e.g. S2. Let S(M,K) represent 
the digital signing of message M by modular exponentiation 
with key K as detailed in a paper by Rivest, R. L., Shamir, 
A., and Adleman, L. Titled "A method for obtaining digital 
signatures and public-key cryptosystems", pubUshed in the 35 
Comm. ACM 21, 2 Feburary 120-126. Once a user applies 
to server. Z for a pseudonym P and is granted a signed 
pseudonym signed with the private key SK^ of server Z, the 
following protocol takes place to establish an entry for the 
user U in the proxy server S2'*s database D. 1 . The user now 40 
sends proxy server S2 the pseudonym, which has been 
signed by Z to indicate the authenticity and uniqueness of 
the pseudonym. The user also generates a PKp, SKp key pair 
for use with the granted pseudonym, where is the private key 
associated with the pseudonym and PKp is the public key 45 
associated with the pseudonym. The^user forms a request to 
est ablish pseudonym P on proxy server S2. by sending t he 
^gned pseudonym S(P, SK^) to the proxy server S2 al ong 
w ith a request to create a new dalahp^JntrY^ i ndexed byJ . 
arid the public kev PKp. It envelopes the message a nd 50 
transmit s it to a proxy server S2 through an anonymi zing 
mix patn, along with au auuiiymotis remm en velope header . 
2. thFproxy server Si receiOtiiJ the dalabdiie creation entry 
request and associated certified pseudonym message. The 
proxy server S2 checks to ensure that the requested pseud- 55 
onym P is signed by server Z and if so grants the request and 
creates a database entry for the pseudonym, as well as 
storing the user's public key PKp to ensure tliat only the user 
U can make requests in the future using pseudonym P. 3. The 
structure of the user's database entry consists of a user 60 
profile as detailed herein, a target profile interest summary as 
detailed herein, and a Boolean combination of access control 
criteria as detailed below, along with the associated public 
key for the pseudonym P. 4. At any time after database entry 
for Pseudonym P is established, the user U may provide 65 
proxy server S2 with credentials on that pseudonym, pro- 
vided by third parties, which credentials make certain asser- 
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tions about that pseudonym. The proxy server may verify 
those credentials and make appropriate modifications to the 
user's profile as required by these credentials such as 
recording the user's new demographic status as an adult. It 
may also store those credentials, so that it can present them 
to service providers on the user's behalf. 

The above steps may be repeated, with either the same or 
a different proxy server, each time user U requires a new 
pseudonym for use with a new and disjoint coalition of 
providers. In practice there is an extremely small probability 
that a given pseudonym may have already been allocated by 
due to the random nature of the pseudonym generation 
process carried out by Z. If this highly unlikely event occurs, 
then the proxy server S2 may reply to the user with a signed 
message indicating that the generated pseudonym has 
already been allocated, and asking for a new pseudonym to 
be generated, 

Pscudonymotis Control of an Information Server 

Once a proxy server S2 has authenticated and registered 
a user's pseudonym, the user may begin to use the services 
of the proxy server S2, in interacting with other network 
entities such as service providers, as exemplified by server 
S4 in FIG. 2, an information service provider node con- 
nected to the network. The user controls the proxy server S2 
by forming digitally encoded requests that the user subse- 
quently transmits to the proxy server S2 over the network N. 
The nature and format of these requests will vary, since the 
proxy server may be used for any of the services described 
in this application, such as the browsing , querying, and 
other navigational functions described below. 

In a generic scenario, the user wishes to communicate 
under pseudonym P with a particular information provider or 
user at address A, where P is a pseudonym allocated to the 
user and A is either a public network address at a server such 
as S4, or another pseudonym that is registered on a proxy 
server such as S4. (In the most common version of this 
scenario, address A is the address of an information provider, 
and the user is requesting that the in formation provider send 
target objects of interest.) The user must form a request R to 
proxy server S2, that requests proxy server S2 to send a 
message to address A and to forward the response back to the 
user. The, tiser may thereby communicate with other parties, 
either non-pseudonymous parties, in the case where address 
A is a public network address, or pseudonymous parties, in 
the case where address A is a pseudonym held by, for 
example, a business or another user who prefers to operate 
pseudonymously. 

In other scenarios, the request R to proxy server S2 
formed by the user may have different content. For example, 
request R may instruct proxy server S2 to use the methods 
described later in this description to retrieve from the most 
convenient server a particular piece of information that has 
been multicast to many servers, and to send this information 
to the user. Conversely, request R may instruct proxy server 
S2 to multicast to many servers a file associated with a new 
target object provided by the user, as described below. If the 
user is a subscriber to the news clipping service described 
below, request R may instruct proxy server S2 to forward to 
the user all target objects that the news clipping service has 
sent to proxy server S2 for the user's attention. If the user is 
employing the active navigation service described below, 
request R may instruct proxy server S2 to select a particular 
cluster from the hierarchical cluster tree and provide a menu 
of its subclusters to the user, or to activate a query that 
temporarily affects proxy server S2's record of the user's 
target profile interest summary. If the user is a member of a 
virtual community as described below, request R may 
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instruct proxy server S2 to forward to the user all messages 
that have been sent to the virtual coramunity. 

Regardless of the content of request R, the user, at client 
C3j m lliatea a connection to the user's local server SI, and 
instructs server SI to send the request R along a secure mix 5 
path to the proxy server S2, initiating the following sequence 
of actions: 

1. The user's client processor C3 forms a signed message 
S(R, SK^), which is paired with the user's pseudonym 
P and (if the request R requires airesponse) a secure 
one-time set of return envelopes, to fonn a message M. 
It protects the message M with an multiply enveloped 
route for the outgoing path. The enveloped route s 
provide for secure communication between SI and the 
proxy server S2. The message M is enveloped in the 
most deeply nested message and is therefore difi&cult to 
recover should the message be intercepted by an eaves- 
dropper. 

2. The message M is sent by client C3 to its local server 
SI, and is then routed by the data communication 
network, N from server SI through a set of mixes as 
dictated by the outgoing envelope set and arrives at the 
selected proxy server S2. 

3. The proxy server S2 separates the received message M 
into the request message R, the pseudonym P, and (if 25 
included) the set of envelopes for the return path. The 
proxy server S2 uses pseudonym P to index and retrieve 
the corresponding record in proxy server S2's database, 
which record is stored in local storage at the proxy 
server 82 or on other distributed storage media acces- 3Q 
sible to. proxy server S2 via the network N. This, record 
contains a public key PKp, user-specific information, 
and credentials associated with pseudonym P. The 
proxy server S2 uses the public key PKp to check that 
the signed version S(R, SKp) of request message R is 35 
valid. 

4. Provided that the signature on request message R is 
valid, the proxy server S2 acts on the request R. For 
example, in the generic scenario described above, 
request message R includes an embedded message Ml 40 
and an address A to whom message Ml should be sent; 

in this case, proxy server S2 sends message Ml to the 
server named in address A, such as server S4. The 
communication is done using signed and optionally 
encrypted messages over the normal point to point 45 
connections provided by the data communication net- 
work N. When necessary in order to act on embedded 
message Ml, server S4 may exchange or be caused to 
exchange further signed and optionally encrypted mes- 
sages with proxy server S2, stiU over normal point to 50 
point connections, in order to negotiate the release of 
user-specific information and credentials from proxy 
server S2. In particular, server S4 may require server S2 
to supply credentials proving that the user is entitled to 
the information requested — ^for example, proving that 55 
the user is a subscriber in good standing to a particular 
information service, that the user is old enough to 
legally receive adult material, and that the user has been 
offered a particular discount (by means of a special 
discount credential issued to the user's pseudonym), go 

5. If proxy server S2 has sent a message to a server S4 and 
server S4 has created a response M2 to message Ml to 
be sent to the user, then server S4 transmits the 
response M2 to the proxy server S2 using normal 
network point-to-point connections. 65 

6. The proxy server 82, upon receipt of the response M2, 
creates a return message Mr comprising the response 
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M2 embedded in the return envelope set that was 
earlier transmitted to proxy server S2 by the user in the 
original message M. It transmits the return message Mr 
along the pseudonymous mix path specified by this 
return envelope set, so that the response M2 reaches the 
user at die user's client processor C3. 

7. The response M2 may contain a request for electronic 
payment to the information server S4. The user may 
then respond by means of a message M3 transmitted by 
the same means as described for message Ml above, 
which message M3 encloses some foma of anonymous 
payment. Altematively, the proxy server may respond 
automatically with such a payment, which is debited 
from an account maintained by the proxy server for this 
user, 

8. Either the response message M2 from the information 
server S4 to the user, or a subsequent message sent by 
the proxy server S2 to the user, may contain advertising 
material that is related to the user's request and/or is 
targeted to the user. Typically, if the user has just 
retrieved a target object X, then (a) either proxy server 
S2 or information server S4 determines a weighted set 
of advertisements that are "associated with" target 
object X, (b) a subset of this set is chosen randomly, 
where the weight of an advertisement is proportional to 
the probability that it is included in the subset, and (c) 
proxy server S2 selects from this subset just those 
advertisements that the user is most likely to be inter- 
ested in. In the variation where proxy server S2 deter- 
mines the set of advertisements associated with target 
object X, then this set typically consists of all adver- 
tisements that the proxy server^s owner has been paid 
to disseminate and whose target profiles are within a 
threshold similarity distance of the target profile of 
target object X. In the variation where proxy server 84 
determines the set of advertisements associated with 
target object X, advertisers typically purchase the right 
to include advertisements in this set. In either case, the 
weight of an advertisement is determined by the 
amount that an advertiser is willing to pay. Following 
step (c), proxy server S2 retrieves the selected adver- 
tising material and transmits it to the user's client 
processor C3, where it will be displayed to the user, 
within a specified length of time after it is received, by 
a trusted process running on the user's client processor 
C3. When proxy server S2 transmits an advertisement, 
it sends a message to the adverdser, indicating that the 
advertisement has been transmitted to a user with a 
particular predicted level of interest. The message may 
also indicate the identity of target object X. In return, 
the advertiser may transmit an electronic payment to 
proxy server S2; proxy server S2 retains a service fee 
for itself, optionally forwards a service fee to informa- 
tion server S4, and the balance is forwarded to the user 
or used to credit the user's account on the proxy server. 

9. If the response M2 contains or identifies a target object, 
the passive and/or active relevance feedback that the 
user provides on this object is tabulated by a process on 
the user's client processor C3. A summary of such 
relevance feedbadc information, digitally signed by 
client processor C3 with a proprietary private key 
SKc3, is periodically transmitted through an a secure 
mix path to the proxy server S2, whereupon the search 
profile generation module 202 resident on server S2 
updates the appropriate target profile interest summary 
associated with pseudonym P, provided that the signa- 
ture on the summary message can be authendcated with 
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the corresponding public key PK^3 which is available 
to all tabulating process that are ensured to have 
integrity. 

When a consumer enters into a financial relationship with 
a particular information server based on both parties agree- 5 
ing to terms for the relationship, a particular pseudonym 
may be extended for the consimier with respect to the given 
provider as detailed in the previous section. When entering 
into such a relationship, the consumer and the service 
provider agree to certain terms. However, if the user violates lo 
the terms of this relationship, the service provider may 
decline to provide service to the pseudonym under which it 
transacts with the user. In addition, the service provider has 
the recourse of refusing to provide resolution credentials to 
the pseudonym, and may choose to do so until the pseud- 15 
onym bearer returns to good standing. 
Pre -Fetching of Target Objects 

In some circumstances, a user may request access in 
sequence to many files, which are stored on one or more 
information servers. This behavior is common when navi- 20 
gating a hypertext system such as the World Wide Web, or 
when using the target object browsing system described 
below. 

In general, the user requests access to a particular target 
object or menu of target objects; once the corresponding file 25 
has been transmitted to the user's client processor, the user 
views its contents and makes another such request, and so 
on. Each request may take many seconds to satisfy, due to 
retrieval and transmission delays. However, to the extent 
that the sequence of requests is predictable, the system for 30 
customized electronic identification of desirable objects can 
respond more quickly to each request, by retrieving or 
starting to retrieve the appropriate files even before the user 
requests them. This early retrieval is termed "pre-fetching of 
files." 35 

Pre-fetching of locally stored data has been heavily stud- 
ied in memory hierarchies, including CPU-caches and sec- 
ondary storage (disks), for several decades. A leader in this 
area has been A. J. Smith of Berkeley, who identified a 
variety of schemes and analyzed opportunities using exten- 40 
sive traces in both databases and CPU caches. His conclu- 
sion was that general schemes only really paid off where 
there was some reasonable chance that sequential access was 
occurring, e.g., in a sequential read of data. As the balances 
between various latencies in the memory hierarchy shifted 45 
during the late 1980's and early 1990*s, J. M. Smith and 
others identified further opportunities for pre-fetching of 
both locally stored data and network data. In particular, 
deeper analysis of patterns in work by Blaba showed the 
possibility of using expert systems for deep pattern analysis 50 
that could be used for pre-fetching. Work by J. M. Smith 
proposed the use of reference history trees to anticipate 
references in storage hierarchies where there was some 
historical data. Recent work by Touch and the Berkeley 
work addressed the case of data on the World-Wide Web, 55 
where the large size of images and the long latencies provide 
extra incentive to pre -fetch; Touch's technique is to pre-send 
when large bandwidths permit some speculation using 
HTML storage references embedded in WEB pages, and the 
Berkeley work uses techniques similar toy J. M. Smith's 60 
reference histories specialized to the semantics of HTML 
data. 

Successful pre-fetching depends on the ability of the 
system to predict the next action or actions of the user. In the 
context of the system for customized elecU:onic identifica- 65 
tion of desirable objects, it is possible to cluster users into 
groups according to the similarity of their user profiles. Any 
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of the well-known pre-fetching methods that collect and 
utilize aggregate statistics on past user behavior, in order to 
predict future user behavior, may then be implemented in so 
as to collect and utilize a separate set of statistics for each 
cluster of users. In this way, the system generalizes its access 
pattern statistics from each user to similar users, without 
generahzing among users who have substantially different 
interests. The system may further collect and utilize a similar 
set of statistics that describes the aggregate behavior of all 
users; in cases where the system cannot confidently make a 
prediction as to what a particular user will do, because the 
relevant statistics concerning that user's user cluster are 
derived from only a small amount of data, the system may 
instead make its predictions based on the aggregate statistics 
for all users, which are derived from a larger amount of data. 
For the sake of concreteness, we now describe a particular 
instantiation of a pre-fetching system, that both employs 
these insights and that makes its pre-fetching decisions 
through accurate measurement of the expected cost and 
benefit of each potential pre-fetch. 

Pre-fetching exhibits a cost-benefit tradeoff. Let t denote 
the approximate number of minutes that pre-fetched files are 
retained in local storage (before they are deleted to make 
room for other pre-fetched files). If the system elects to 
pre-fetch a file corresponding to a target object X, then the 
user benefits from a fast response at no extra cost, provided 
that the user explicitly requests target object X soon there- 
after. However, if the user does not request target object X 
within t minutes of the pre-fetch, then the pre-fetch was 
worthless, and its cost is an added cost that must be borne 
(directly or indirectly) by the user. The first scenario there- 
fore provides benefit at no cost, while the second scenario 
incurs a cost at no benefit. The system tries to favor the first 
scenario by pre-fetching only those files that the user will 
access' anyway. Depending on the user's wishes, the system 
may pre-fetch either conservatively, where it controls costs 
by pre-fetching only files that the user is extremely likely to 
request explicitly (and that arc relatively cheap to retrieve), 
or more aggressively, where it also pre-fetchcs files that the 
user is only moderately likely to request explicitly, thereby 
increasing both the total cost and (to a lesser degree) the total 
benefit to the user. 

In the system described herein, pre-fetching for a user U 
is accomplished by the user's proxy server S. Whenever 
proxy server S retrieves a user-requested file F from an 
information server, it uses the identity of this file F and the 
characteristics of the user, as described below, to identify a 
group of other files Gl , . . Gk that the user is likely to access 
soon. The user's request for file F is said to "trigger" files Gl 
. . , Gk. Proxy server S pre -fetches each of these triggered 
files Gi as follows: 

1. 'Unless file Gi is already stored locally (e.g., due to 
pijcvious pre-fetch), proxy server S retrieves file Gi 
fix)m an appropriate information server and stores it 
locally. 

2. Proxy server S timestamps' its local copy of file Gi as 
having just been pre-fetched, so that file Gi will be 
retained in local storage for a minimum of approxi- 
mately t minutes before being deleted. 

Whenever user U (or, in principle, any other user registered 
with proxy server S) requests proxy server S to retrieve a file 
that has been pre-fetched and not yet deleted, proxy server 
S can then retrieve the file from local; storage rather than 
from another server. In a variation on steps 1-2 above, proxy 
server S pre-fetches a file Gi somewhat differently, so that 
pre-fetched files are stored on the user's client processor q 
rather than on server S: 
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1. If proxy server S has not pre-fetched file Gi in the past 
t minutes, it retrieves file Gi and transmits it to user U's 
client processor q. 

2. Upon receipt of the message sent in step 1, client q 
stores a local copy of file Gi if one is not currently 
stored. 

3. Proxy server S notifies client q that chent q should 
timestamp its local copy of file Gi; this notification may 
be combined with the message transmitted in step 1, if 
any. 

4. Upon receipt of the message sent in step 3, client q 
timestamps its local copy of file Gi as having just been 
pre-fetched, so that file Gi will be retained in local 
storage for a minimum of approximately t minutes 
before being deleted. During the period that client q 
retains file Gi in local storage, client q can respond to 
any request for file Gi (by user U or, in principle, any 
other user of client q) immediately and without the 
assistance of proxy server S. 

The difficult task-is for proxy server S, each time it 
retrieves a file F in response to a request, to identify the files 
GI . . . Gk that should be triggered by the request for file F 
and pre-fetched immediately. Proxy server S employs a 
cost-benefit analysis, performing each pre-fetch whose ben- 
efit exceeds a user-determined multiple of its cost; the user 
may set the miiltiplier low for aggressive prefetching or high 
for conservative prefetching. These prc-fctches may be 
performed in parallel. The benefit of pre-fetching file Gi 
immediately is defined to be the expected number of seconds 
saved by such a pre-fetch, as compared to a situation where 
Gi is left to be retrieved later (either by a later pre-fetch, or 
by the user's request) if at all. The cost of pre-fetching file 
Gi immediately is defined to be the expected cost for proxy 
server S to retrieve file Gi, as determined for example by the 
network locations of server S and file Gi and by information 
provider charges, times 1 minus the probabihty that proxy 
server S will have to retrieve file Gi within t minutes (to 
satisfy either a later pre-fetch or the user's explicit request) 
if it is not pre-fetched now. 

The above definitions of cost and benefit have some 
attractive properties. For example, if iisers tend to retrieve 
either file Fl or file F2 (say) after file F, and tend only in the 
former case to subsequently retrieve file GI, then the system 
will generally not pre-fetch GI immediately after retrieving 
file F: for, to the extent that the user is likely to retrieve file 
E^, the cost of the pre-fetch is high, and to the extent that the 
user is likely to retrieve file Fl instead, the benefit of the 
pre-fetch is low, since the system can save as much or nearly 
as much time by waitiiig until the user chooses Fl and 
pre-fetching GI only then. 

The proxy server S may estimate the necessary costs and 
benefits by adhering to the following discipline: 

1. Proxy server S maintains a set of disjoint clusters of the 
users in ite user base, clustered according to their user 
profiles. , 

2. Proxy server S maintains an initially empty set PFT of 
"pre-fetch triples" <C,F,G>, where F and G are files, 
and where C identifies either a cluster of users or the set 
of all users in the user base of proxy server S. Each 
pre-fetch triple in the set PFT is associated with several 
stored values specific to that triple. Pre-fetch triples and 
their associated values arc maintained according to the 
rules in 3 and 4. 

3. Whenever a user U in the user base of proxy server S 
makes a request R2 for a file G, or a request R2 that 
triggers file G, then proxy server S takes the following 
actions: 
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a. For C being the user cluster containing user U, and 
then again for C being the set of all users: 

b. For any request RO for a file, say file F, made by user 
U during the t minutes strictly prior to the request 

5 R2: 

c. If the triple <C,F,G> is not currently a member of the 
set PFT, it is added to the set PFT with a count of 0, 
a trigger-count of 0, a target-count of 0, a total 
benefit of 0, and a timestamp whose value is the 

10 current date and time. 

d. The count of the triple <C,F,G> is increased by one. 

e. If file G was not triggered or explicitly retrieved by 
any request that user U made strictly in between 
requests RO and R2, then the target-count of the 

15 triple <C,F,G> is increased by one. 

f. If request R2 was a request for file G, then the total 
benefit of triple <C,F,G> is increased either by the 
time elapsed between request RO and request R2, or 
by the expected time to retrieve file G, whichever is 

20 less. 

g. If request R2 was a request for file G, and G was 
triggered or explicitly retrieved by one or more 
requests that user U made strictly in between 
requests RO and R2, with Rl denoting the earliest 

25 such request, then the total benefit of triple <C,F,G> 

is decreased either by the time elapsed between 
request RI and request R2, or by the expected time 
to retrieve file G, whichever is less. 

4. If a user U requests a file F, then the trigger-count is 
30 incremented by one for each triple currently in the set 

PFT such that the triple has form <C,F,G>, where user 
U is in the set or cluster identified by C. 

5. The "age'^ of a triple <C,F,G> is defined to be the 
number of days elapsed between its timestamp and the 
current date and time. If the age of any triple <C,F,G> 
exceeds a fixed constant number of days, and also 
exceeds a fixed constant multiple of the triple 's count, 
then the triple may be deleted from the'set PFT. 

Proxy server. S can therefore decide rapidly which files G 
^ should be triggered by a request for a given file F from a 
given tiser U, as follows. 

^ 1 . Let CO be the user cluster containing user U, and CI be 
the set of all users. 

2. Server S constructs a list L of all triples <CO,F,G> such 
that <CO,F,G> appears in set PFT with a count exceed- 
ing a fixed threshold. 

3. Server S adds to list L all triples <C1,F,G> such that 
<CO,F,G> does not appear on fist L and <C1,F,G> 
appears in set PFT with a count exceeding another fixed 
threshold. 

4. For each triple <C,F,G> on list L: 

5. Server S computes the cost of triggering file G to be 
expected cost of retrieving fiJe Gi, times 1 minus the 

55 quotient of the target-count of <C,F,G> by the trigger- 
count of <C,F,G>. 

6. Server S computes the benefit of triggering file G to be 
the total benefit of <C,F,G> divided by the count of 
<C,F,G>. 

60 7. Finally, proxy server. S uses the computed cost and 
benefit, as described earlier, to decide whether file G 
should be triggered. The approach to pre-fetching just 
described has the advantage that all data storage and 
manipulation concerning pre-fetching decisions by 

65 proxy server S is handled locally at proxy server S. 
However, this "user-based^* approach does lead to 
duplicated storage and effort across proxy servers, as 
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well as incomplete data at each individual proxy server. 
That is, the information indicating what files arc fre- 
quently retrieved after file F is scattered in an uncoor- 
dinated way across numerous proxy servers. An 
alternative, "file-based" approach is to store all such 
information with file F itself The difference is as 
follows. In the user-based approach, a pre- fetch triple 
<C,F,G> in server S*s set PFT may mention any file F 
and any file G on the network, but is restricted to 
clusters C that are subsets of the user base of server S. 
By contrast, in the file-based approach, a prc-fetch 
triple <C,F,G> in server S's set PFT may mention any 
user chister C and any file G on the network, but is 
restricted to files F that are stored on server S. (Note 
that in the file-based approach, user clustering is net- 
work wide, and user clusters may include users from 
different proxy servers.) When a proxy server S2 sends 
a request to server S to retrieve file F for a user U server 
S2 indicates in this message the user U's user cluster 
CO, as well as the user U's value for the user- 
determined multiplier that is used in cost-benefit analy- 
sis. Server S can use this information, together with all 
its triples in its set PFT of the form <CO,F,G> and 
<C1,F,G>, where CI is the set of all users everywhere 
on the network, to determine (exactly as in the user- 
based approach) which files Gl . . . Gk are triggered by 
the request for file F. When server S sends file F back 
to proxy server S2, it also sends this list of files Gl . . 
. Gk, so that proxy server S2 can proceed to pre-fetch 
files Gl . . . Gk. 
The file -based approach requires some additional data 
transmission. Recall that under the user-based approach, 
server S must execute steps 3c-3g above for any ordered 
pair of requests RO and R2 made within t minutes of each 
other by a user who employs server S as a proxy server. 
Under the file-based approach, server S must execute steps 
3c-3g above for any ordered pair of requests RO and R2 
made within t minutes of each other, by any user on the 
network, such that RO requests a file stored on server S. 
Therefore, when a user makes a request R2, the user's proxy 
server must send a notification of request R2 to all servers 
S such that, during the preceding t minutes (where the 
variable t may now depend on server S), the user has made 
a request RO for a file stored on server S. This notification 
need not be sent immediately, and it is generally more 
efficient for each proxy server to buffer up such notifications 
and send them periodically in groups to the appropriate 
servers. 

Access And Reachability Control of Users and User-Specific 
Information 

Although users true identities are protected by the use of 
secure mix paths, pseudonymity does not guarantee com- 
plete privacy. In particular, advertisers can in principle 
employ user-specific data to barrage users with unwanted 
solicitations. The general solution to this problem is for 
proxy server S2 to act as a representative on behalf of each 
user in its user base, permitting access to the user and the 
user's private data only in accordance with criteria that have 
been set by the user. Proxy server S2 can restrict access in 
two ways: 

1. The.proxy server 82 may restrict access by third parties 
to server S2*s pseudonymous database of user-specific infor- 
mation. When a third party such as an advertiser sends a 
message to server S2 requesting the release of user-specific 
information for a pseudonym P, server S2 refuses to honor 
the request unless the message includes credentials for the 
accessor adequate to prove that the accessor is entitled to this 
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information. The user associated with pseudonym P may at 
any lime send signed control messages to proxy server S2, 
specifying the credentials or Boolean combinations of cre- 
dentials that proxy server S2 should thenceforth consider to 
5 be adequate grounds for releasing a specified subset of the 
information associated with pseudonym P. Proxy server S2 
stores these access criteria with its database record for 
pseudonym P. For example, a user might wish to proxy 
server S2 to release purchasing information only to selected 
information providers, to charitable organizations (that is, 
organizations that can provide a government-issued creden- 
tial that is issued only to registered charities), and to market 
researchers who have paid user U for the right to study user 
U's purchasing habits. 
2 . The proxy server S2 mav restrict the ability of thir d 
15 parties to sen d electronic messages to the user. When a th ird 
l 55fty^Uch as an advertiser attempts to send informa tion: 
( such as a textual message or a request to enter into spok en 
or written realtime communication) to pseudonym fi_ by 
Se nding a m essage to proxy server S2 req uesting j roxy 
20 server iiZ to torWard the mtormafion 10 the user at jseud- 
Onym H proxy server ^2 will retuse to nonor tnereg ucst, 
unless the message includes credentials for the accessor 
adequate to meet the requirements the user has chosen to 
impose, as above, on third parties who wish to send infor- 
ms mation to the user. If the message does include adequate 
credentials, then proxy server S2 removes a single -;ise 
pseudonymous return address envelope from it s database 
record for pseudonym P, and uses the envelope to send a 
message containing the specified information along a secure 
30 mix path to the user of pseudonym P. If the envelope being 
used is the only envelope stored for pseudonym P, or more 
generally if the supply of such envelopes is low, proxy 
server S2 adds a notation to this message before sending it, 
which notation indicates to the user's local server that it 
35 should send additional envelopes to proxy server 82 for 
future use. 

In a more general variation, the user may instruct the 
proxy server 82 to impose more complex requirements on 
the granting of requests by third parties, not simply boolean 
40 combinations of required credenrials. The user may impose 
any Boolean combination of simple requirements that may 
include, but are not limited to, the following: 
(a.) the accessor (third party) is a particular party 
(b.) the accessor has provided a particular credential 
(c.) satisfying the request would involve disclosure to the 
accessor of a certain fact about the user's user profile 
(d.) satisfying the request would involve disclosure to the 
accessor of the user's target profile interest summary 
5Q (e.) satisfying the request would involve disclosure to the 
accessor of statistical summary data, which data are 
computed from the user's user profile or target profile 
interest summary together with the user profiles and 
target profile interest summaries of at least n other users 
55 in the user base of the proxy server 

(f.) the content of the request is to send the user a target 
object, and this target object has a particular attribute 
(such as high reading level, or low vulgarity, or an 
authenticated Parental Guidance rating from the 
60 MPAA) 

(g.) the content of the request is to send the user a target 
object, and this target object has been digitally signed 
with a particular private key (such as the private key 
used by the National Pharmaceutical Association to 
65 certify approved documents) 

(h.) the content of the request is to send the user a target 
object, and the target profile has been digitally signed 
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by a profile authentication agency, guaranteeing that 5. The active and/or passive relevance feedback provided 

the target profile is a true-and accurate profile of the by any user U with respect to any target object sent by any 

target object it claims to describe, with all attributes path from the accessor is tabulated by the above-described 

authenticated tabulating process resident on user U's client processor C3. 

(i.) the content of the request is to send the user a target 5 As described above, a summary of such information is 

object, and the target profile of this target object is periodically transmitted to the proxy server S2 to enable the 

within a specified distance of a particular search profile proy server S2 to update that user's target profile interest 

specified by the user summary and user profile. 

(j.) the content of the request is to send the user a target fhe access control criteria can be applied to solicited as 

object, and the proxy server S2, by using the user's well as unsolicited transmissions. That is, the proxy server 

stored target profile interest summary, estimates the ^.^q. ^^^j ^ protect the user from inappropriate or 

user's likely interest in the target object to be above a misrepresented target objects that the user may request. If 

specified threshold requests a target object from an information server, 

(k.) the accessor indicates its willingness to make a (he target object turns out not to meet the access control 

particular payment to the user in exchange for the criteria, then the proxy server will not permit the information 

fulfillment of the request , . . , , server to transmit the target object to the user, or to charge 

Hie steps required to create and mamtain the user s ^^^^ transmission. For example, to guard 

acce^control requirements re as follows: ^^^^ ^^^^^ ^^^^ ^^^^ ^^ ^^^^ 

1. The user composes a boolean combination ot predicates ... c . 1 ■ *u * 
that apply to re^iests; the resulting complex predicate ^^^^ -^^e user may specify an access control criterion that 
should b^ true when applied to a request that the user wants ^ ^he provider to prove the target profile s accuracy 
proxy server S2 to honor, and false otherwise. The complex by means of a digital agnature from a profile au henticaUon 
predicate may be encoded in another form, for efficiency. agency. As another example, the parents of a child user may 

2. The complex predicate is signed with SKp, and trans- instruct the proxy server that only target objects that have 
mitted from the user's cHent processor C3 to the proxy been digitally signed by a recognized child protection orga- 
serverS2 through the mix path enclosed in a packet that 25 nizaUon may be transmitted to the user; thus, the proxy 
also contains the user's pseudonym P. server will not let the user retrieve pornography, even from 

3. The proxy server S2 receives the packet, verifies its a rogue information server that is willing to provide por- 
authenticity using PK^ and: stores the access control nography to users who have not supplied an adulthood 
instructions specified in the packet as part of its data- credential. 

base record for pseudonym P. 30 Distribution of Information with Multicast Trees 

The proxy server S2 enforces access control as follows: The graphical representation of the network N presented 

1. The third party (accessor) transmits a request to proxy in FIG. 3 shows that at least one of the data communications 

server S2 using the normal point-to-point connections pro- links can be eliminated, as shown in FIG. 4, while still 

vided by the network N. The request may be to access the enabUng the network N to transmit messages among all the 

target profile interest summaries associated with a set of 35 servers A-D. By elimination, we mean that the link is 

pseudonyms PI . . . Pn, or to access the user profiles unused in the logical design of the network, rather than a 

associated with a set of pseudonyms PI . . . Pn, or to forward physical disconnection of the link. The graphs that result 

a message to the users associated with pseudon)TOS PI ... ^^^^ redundant data communications links are elimi- 

Pn. -Hie accessor may explicitly specify the pseudonyms PI ^^^^^ ^^^^^^ "connected acycUc graphs." A 

. . Pn, or may ask that PI . . Pn be chosen to be the sc ^ ^^^^^ ^ ^ transmitted by a server 

of all pseudonyms registered with proxy server S2 that meet f^^^^^^ ^^^^ ^^^^^^ ^^^^^ transmitting 

'^'^The^roxy se^er S2 indexes t he database recjaaUsr ^^^^^ t^r?!,^^^^^^^^ communications link 

eacteim rPm^ ^i^g^ is termed a "cycle." A tree is thus an acychc graph whose 

ments provided by the user^ sociated ^^ithS ^^nd^Eter- edges (links) connect a set of graph nodes' (servers). The 

minVwEet SeTand how the trlS^smitted reouc st-sheuld be 45 tree can be used to efficiently broadcast any data file to 

satisfied tor Pi. if the requirement s are satisfied, S2 proceeds selected servers in a set of interconnected servers, 

^iih ste pp 3/i-3c. The tree structure is attractive in a communications nel- 

CTfthe request can be satisfied but only upon payment work because much information distribution is multicast in 

of a fee, the proxy server S2 transmits a payment request to nature — that is, a piece of information available at a single 

the accessor, and waits for the accessor to send the payment so source must be distributed to a multiplicity of points where 

to the proxy server S2. Proxy server S2 retains a service fee the information can be accessed. This technique is widely 

and forward s the balance of the payment to the user known: for example, "FAX trees" are in common use in 

associated with pseudonym Pi, via an anonymous return political organizations, and multicast trees are widely used 

packet that this user has provided. in distribution of multimedia data in the Internet; for 

3b. If the request can be satisfied but only upon provision 55 example, see "Scalable Feedback Control for Multicast 

of a credential, the proxy server S2 transmits a credential Video Distribution in the Internet," (Jean— Chrysostome 

request to the accessor, and waits for the accessor to send the Bolot, Thien7 TVrietti, &: Ian Wakeman, Computer Com- 

credential to the proxy server S2. munication Review, Vol. 24, #4, October '94, Proceedings 

3c. The proxy server S2 satisfies the request by disclosing of. SIGCOMM'94, pp. 58-67) or "An Architecture For 

user-specific information to the accessor, by providing the 60 Wide-Area Multicast Routing," (Stephen Deering, Deborah 

accessor with a set of single-use envelopes to communicate Estrin, Dino Farinacci, Van Jacobson, Ching-Gung Liu, & 

directly with the user, or by forwarding a message to the Liming Wei, Computer Communication Review, Vol. 24, #4, 

user, as requested, October '94, Proceedings of SIGCOMM'94, pp. 126-135). 

4. Proxy server S2 optionally sends a message to the While there are many possible trees that can be overlaid 00 
accessor, indicating why each of the denied requests for PI 65 a graph representation of a network, both the nature of the 
. . . Pn was denied, and/or indicating how many requests networks (e.g., the cost of transmitting data over a link) and 
were satisfied. ' their use (for example, certain nodes may exhibit more 
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frequent intercommunication) can make one choice of tree 
better than another for use as a multicast tree. One of the 
most difiBcult problems in practical network design is the 
construction of "good" multicast trees, that is, tree choices 
which exhibit low cost (due to data not traversing links 5 
unnecessarily) and good performance (due to data frequently 
being close to where it is needed) 
Constructing a Multicast Tree 

Algorithms for constructing multicast trees have either 
been ad-hoc, as is the case of the Deering, et al. Internet lo 
multicast tree, which adds clients as they request service by 
grafting them into the existing tree, or by construction of a 
minimum cost spanning tree. A distributed algorithm for 
creating a spanning tree (defined as a tree that connects, or 
"spans," all nodes of the graph) on a set of Ethernet bridges 15 
was developed by Radia Perlman ("Interconnections: 
Bridges and Routers," Radia Perlman, Addison-Wesley, 
1992). Creating a minimal-cost spanning tree for a graph 
depends on having a cost model for the arcs of the graph 
(corresponding to communications 1 inks in the communi- 20 
cations network). In the case of Ethernet bridges, the default 
cost (more complicated costing models for path costs are 
discussed on pp. 72^73 of Perlman) is calculated as a simple 
distance measure to the root; thus the spanning tree mini- 
mizes the cost to the root by first electing a unique root and 25 
then constructing a sparming tree based on the distances 
from the root. In this algorithm, the root is elected by 
recourse to a numeric ID contained in "configuration mes- 
sages.": the server w hose ID has minimum numeric value 
is chosen as the root. Several problems exist with this 30 
algorithm in general. First, the method of using an ID does 
not necessarily select the best root for the nodes intercon- 
nected in the tree. Second, the cost model is simpUstic. 

We first show how to use the similarity-based methods 
described above to select the servers most interested in a 35 
group of target objects, herein termed "core servers" for that 
group. Next we show how to construct an unrooted multicast 
tree that can be used to broadcast files to these core servers. 
Finally, we show how files corresponding to target objects 
are actually broadcast through the multicast tree at the 40 
initiative of a client, and how these files are later retrieved 
from the core servers when clients request them. 

Since the choice of core servers to distribute a file to 
depends on the set of tisers who are likely to retrieve the file 
(that is, the set of users who are likely to be interested in the 45 
corresponding target object), a separate set of core servers 
and hence a separate multicast tree may be used for each 
topical group of target objects. Throughout the description 
below, servers may communicate among themselves 
through any path over which messages can travel; the goal: so 
of each multicast tree is to optimize the multicast distribu- 
tion of files corresponding to target objects of the corre- 
sponding topic, Note that this problem is completely distinct 
from selecting a multiplicity of spanning trees for the 
complete set of interconnected nodes as disclosed by Sin- 55 
coskie in U.S. Pat. No. 4,706,080 and the publication titled 
"Extended Bridge Algorithms for Large Networks" by W, D. 
Sincoskie and C. J. Cotton, published January 1988 in IEEE 
Network on pages 16-24. The trees in this disclosure are 
intentionally designed to interconnect a selected subset of 60 
nodes in the system, and are successful to the degree that this 
subset is relatively small. 
Multicast Tree Construction Procedure 

A set of topical multicast trees for a set of homogenous 
target objects may be constructed or reconstructed at any 65 
time, as follows. Tlie set of target objects is grouped into a 
fixed number of topical clusters CI . . . Cp with the methods 
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described above, for example, by choosing CI ... Cp to be 
the result of a k-mcans clustering of the set of target objects, 
or alternatively a covering set of low-level clusters from a 
hierarchical cluster tree of these target objects. A multicast 
tree MT(c) is then constructed from each cluster C in CI . . . 
Cp, by the following procedure: 

1. Given a set of proxy servers, SI . . . Sn, and a-topical 
cluster C. It is assumed that a general multicast tree MTy^,;; 
that contains all the proxy servers SI . . . Sn has previously 
been constructed by well-known methods. 

2. Each pair <Si, C> is associated with a weight, w(Si, C), 
which is intended to covary with the expected number of 
users in the user base of proxy server Si who will subse- 
quently access a target object from cluster C. This weight is 
computed by proxy server Si in any of several ways, all of 
which make use of the similarity measurement computation 
described herein. 

One variation makes use of the following steps: (a) Proxy 
server Si randomly selects a target object T from cluster C. 
(b) For each pseudonym in its local database, with associ- 
ated user U, proxy server Si applies the techniques disclosed 
above to user U's stored user profile and target profile 
interest summary in order to estimate the interest w(U, T) 
that user U has in the selected target object T. The aggregate 
interest w(Si, T) that the user base of proxy server Si has in 
the target object T is defined to be the simi of these interest 
values w(U, T). Alternatively, w(Si, T) may be defined to be 
the sum of values s(w(U, T)) over all U in the user base. 
Here s(*) is a sigmoidal function that is close to 0 for small 
arguments and close to a constant p„^ for large argiunents; 
thus s(w(U, T)) estimates the probability that user U will 
access target object T, which probability is assumed to be 
independent of the probability that any other user will access 
target object T. In a variation, w(Si, T) is made to estimate 
the probability that at least one user from the user base of Si 
will access target object T: then w(Si, T) may be defined as 
the maximum of values w(U, T), or of 1 minus the product 
over the users U of the quantity (l-s(w(U, T))). (c)Proxy 
server Si repeats steps (a)-{b) for several target objects T 
selected randomly from cluster C, and averages the several 
values of w(Si, T) thereby computed in step (b) lo determine 
the desired quantity w(Si, C), which quantity represents the 
expected aggregate interest by the user base of proxy server 
Si in the target objects of cluster C. 

In another variation, where target profile interest summa- 
ries are embodied as search profile sets, the following 
procedure is followed to compute w(Si, C): (a). For each 
search profile P^ in the locally stored search profile set of 
any user in the user base of proxy server Si, proxy server Si 
computes the distance d(Pj, P^-) between the search profile 
and the cluster-profile P^ of cluster C. (b) w(Si,C) is chosen 
to be the maximum value of {-6(PsyPj)/T) across all such 
search profiles P^, where r is computed as an afiSne function 
of the cluster diameter of cluster C. The slope and/or 
intercept of this aflQne function are chosen to be smaller 
(thereby increasing w(Si, C)) for servers Si for which the 
target object provider wishes to improve performance, as 
may be the case if the users in the user base of proxy server 
Si pay a premium for improved performance, or if perfor- 
mance at Si win otherwise be unacceptably low due to slow 
network connections. 

In another variation, the proxy server Si is modified so 
that it maintains not only target profile interest sunmiaries 
for each user in its user base, but also a single aggregate 
target profile interest summary for the entire user base. This 
aggregate target profile interest summary is determined in 
the usual way from relevance feedback, but the relevance 



01/14/2003, EAST Version: 1.03.0002 



us 6,460,036 Bl 

51 52 

feedback on a target object, in this case, is considered to be nearby core servers that can be inexpensively contacted by 

the frequency with which users in the user base retrieved the proxy server Si over virtual point-to-point links, 

target object when it was new. Whenever a user retrieves a In the network of FIG. 3, to illustrate the use of trees, as 

target object by means of a request to proxy server Si, the applied to the system of the present invention, consider the 

aggregate target profile interest summary for proxy server Si 5 following simple example where it is assumed that client r 

is updated. In this variation, w(Si, C) I estimated by the provides on-line information for the network, such as an 

following steps: electronic newspaper. This information can be structured by 

(a) Proxy server Si randomly selects a target object T-from client r into a prearranged form, comprising a number of 
cluster C files, each of which is associated with a different target 

(b) Proxy se'rver Si applies the techniques disclosed above 10 ^^'J^^l* electronic newspaper, the files can 

*^ '*^L^^^A ^* contam textual representations of stock pnces, weather 

to Its stored aggregate target profile mterest summary m fo„.^t„ editoriak etc The svstem determines likelv 

order to esUmate the aggregaie mterest w(Si, T) that Us ^^^^^^ ^ ^^^^^^^ 

aggregated user base had m the selected target objec T, ^^^^ ^^^^ fj^^ distribution of the files through the 

when new; this may be interpreted as an esUmate of the ^^^^^^^ ^ interconnected cUents p-s and proxy servers 

hkehhood that at least one member of the user base will 15 a-D. Assume that cluster C consists of text articles relating 

retrieve a new target object similar to T. aerospace industry; further assume that the target 

(c) Proxy server Si repeats steps (aHt^) for several target profile interest summaries stored at proxy servers A and B 
objects T selected randomly from cluster C, and aver- for the users at clients p and r indicate that these users are 
ages the several values of w(Si, 1) thereby computed in strongly interested in such articles. Then the proxy servers A 
step (b) to determine the desired quantity w(Si, Q, 20 and B are selected as core servers for the multicast tree 
which quantity represents the expected aggregate inter- MT(C). The multicast tree MT(C) is then computed to: 
est by the user base of proxy server Si in the target consist of the core servers, A and B, connected by an edge 
objects of cluster C. that represents the least costly virtual point-to-point link 

3. Those servers Si from among SI . . . Sn with the between A and B (either the direct path A-B or the indirect 
greatest weights w(Si, C) are designated "core servers" for 25 path A-C-B, depending on the cost). 

cluster C. In one variation, where it is desired to select a Global Requests to Multicast Trees 

fixed number of core servers, those servers Si with the One type of message that may be transmitted to any proxy 

greatest values of w(Si, C) are selected. In another variation, server S is termed a "global request message." Such a 

the value of w(Si, C) for each server Si is compared against message M triggers the broadcast of an embedded request R 

a fixed threshold w^^ and those servers Si such that w(Si, 30 to all core servers in a multicast tree MT(C). The content of 

C) equals or exceeds w^,„ are selected as core servers. If request R and the identity of cluster C are included in the 

cluster C represents a narrow and specialized set of target message M, as is a field indicating that message M is a 

objects, as often happens when the clusters CI ... Cp are global request message. In addition, the message M contains 

numerous, it is usually adequate to select only small number a field Si;^,, which is unspecified except under certain 

of core server cluster C, thereby obtaining substantial advan- 35 circumstances described below, when it names a specific 

tages in computational efficiency in steps 4-5 below core server. A global request message M may be transmitted 

4. A complete graph G(C) is constructed whose vertices to proxy server S by a user registered with proxy server S 
are the designated core servers for cluster C. For each pair which transmission may take place along a pseudonymous 
of core servers, the cost of transmitting a message between mix path, or it may be transmitted to proxy server S from 
those core servers along the cheapest path is estimated, and 40 another proxy server, along a virtual point-to-point connec- 
the weight of the edge connecting those core servers is taken tion. 

to be this cost. The cost is determined as a suitable function When a proxy server S receives a, message M that is 

of average transmission charges, average transmission delay, marked as a global request message, it acts as follows: 1. If 

and worst-case or near- worst-case transmission delay. proxy server S is not a core server for topic C, it retrieves its 

5. The multicast tree MT(C) is computed by standard 45 locally stored list of nearby core servers for topic C, selects 
methods to be the minimum spanning tree (or a near- from this list a nearby core server S', and transmits a copy 
minimum spanning tree) for G(C), where the weight of an of message M over a virtual point-to-point connection to 
edge between two core servers is taken to be the cost of core server S'. If this transmission fails, proxy server S 
transmitting a message between those two core servers. Note repeats the procedure with other core servers on its list. 2. If 
that MT(C) does not contain as vertices all proxy servers so proxy server S is a core server for topic C, it executes the 
SI . . . Sn, but only the core servers for cluster C. following steps: (a) Act on the request R that is embedded 

6. A message M is formed describing the cluster profile in message M. (b) Set S^urr be S(C) Retrieve the locally 
for cluster C, the core servers for cluster C and the topology stored subtree of MT(C), and extract fi^om it a Ust L of all 
of the multicast tree MT(C) constructed on those core core servers that are direcUy linked to S^„^ in this subtree, 
servers. Message M is broadcast to all proxy servers SI . . . 55 (d) If the message M specifies a value for S,^ ^last 
Sn by means of the general multicast tree MT^„/,. Each proxy appears on the list L, remove S^^ from the list L. Note that 
server Si, upon receipt of message M, extracts the cluster list L may be empty before this step, or may become empty 
profile of cluster C, and stores it on a local storage device, as a result of this step, (e) For each server Si in list L, 
together with certain other information that it determines transmit a copy of message M from server S to server Si over 
from message M, as follows. If proxy server Si is named in 60 a virtual point-to-point connection, where the S/^^, field of 
message M as a core server for cluster C, then proxy server the copy of message M has been altered to S^^. If Si cannot 
Si extracts and stores the subtree of MT(C) induced by all be reached in a reasonable amount of time by any virtual 
core servers whose path distance from Si in the graph MT(C) point-to-point connection (for example, server Si is broken), 
is less than or equal to d, where d is a constant positive recurse to step (c) above with S^^^ bound to S^^ and S^^^.^ 
integer, (usually from 1 to 3). If message M does not name 65 bound to S{\sub 1} for the duration of the recursion, 
proxy server Si as a core server for MT(C), then proxy server When server S' in step 1 or a server Si in step 2(e) receives 
Si extracts and stores a list of one or more a copy of the global request message M, it acts according to 



01/14/2003, EAST Version: 1.03.0002 



us 6,460,036 Bl 

53 54 

exactly the same steps. As a result, all core servers eventu- servers in the multicast tree MT(Ck) to delete any local copy 

ally receive a copy of global request message M and act on Qf file F that they may be, storing, 

the embedded request R, unless some core servers caimot be Queries to Multicast Trees 

reached. Even if a core server is unreachable, step (e) In addition to global request messages, another type of 

ensures that the broadcast can continue to other core servers 5 message that may be transmitted to any proxy server S is 

in most circumstances, provided that d>l; higher values of termed a "query message/* When transmitted to a proxy 

d provide additional insurance against unreachable core server, a query message causes a reply to be sent to the 

servers. originator of the message, this reply will contain an answer 

Multicasting Files to a given query Q if any of the servers in a given muUicast 

The system for customized electronic information of lO tree MT(C) are able to answer it, and will otherwise indicate 
desirable objects executes the following steps in order to that no answer is available. The query and the cluster C are 
introduce a new target object into the system. These steps are named in the query message. In addition, the query message 
initiated by an entity E, which may be cither a user entering contains a field S,^, which is unspecified except under 
commands via a keyboard at a client processor q, as illus- certain circumstances described below, when it names a 
trated in FIG. 3, or an automatic software process resident on 15 specific core server. When a proxy server S receives a 
a client or server processor q. 1. Processor q forms a signed message M that is marked as a query message, it acts as 
request R, which asks the receiver to store a copy of a file follows: 1. Proxy server S sets A,, to be the return address for 
F on its local storage device. File F, which is maintained by the client or server that transmitted message M to server S. 
client q on storage at chent q or on storage accessible by A^ may be either a network address or a pseudonymous 
client q over the network, contains the informational content 20 address 2. If proxy server S is not a core server for cluster 
of or an identifying description of a target object, as C, it retrieves its local stored list of nearby core servers for 
described above. The request R also includes an address at topic C, selects from this fist a nearby core server S', and 
which entity E may be contacted (possibly a pseudonymous transmits a copy of the locate message M over a virtual 
address at some proxy server D), and asks the receiver to point-to-point connection to core server S\ If this transmis- 
store the fact that file F is maintained by an entity at said 25 sion fails, proxy server S repeats the procedure with other 
address, 2. Processor q embeds request R in a message Ml, core servers on its list. Upon receiving a reply, it forwards 
which it pseudonymously transmits to the entity E's proxy this reply to address A,,. 3. If proxy server S is a core server 
server D as described above. Message Ml instructs proxy for cluster C, and it is able to answer query Q using locally 
server D to broadcast request R along an appropriate mul- stored information, then it transmits a "positive" reply to A,, 
ticast tree. 3. Upon receipt of message Ml, proxy server D 30 containing the answer. 4, If proxy server S is a core server 
examines the doubly embedded file F and computes a target for topic C, but it is unable to answer query Q using locally 
profile P for the corresponding target object. It compares the stored information, then it carries out a parallel depth-first 
target profile P to each of the cluster profiles for topical search by executing the following steps: (a) Set L to be the 
clusters CI . . . Cp described above, and chooses Ck to be empty list, (b) Retrieve the locally stored subtree of MT(C). 
the cluster with the smallest similarity distance to profile P. 35 For each server Si directly linked to S^„^ in this subtree, 
4. Proxy server D sends itself a global request message M other than S/^^, (if specified), add the ordered pair (Si, S) to 
instructing itself to broadcast request R along the topical the list L. (c) If L is empty, transmit a "negative" reply to 
multicast tree MT(Ck). 5. Proxy server D notifies entity E address A^ saying that server S cannot locate an answer to 
through a pseudonymous communication that file F has been query Q, and terminate the execution of step 4; otherwise 
multicast along the topical multicast tree for cluster Ck. As 40 proceed to step (d). (d) Select a list LI of one or more server 
a result of the procedure that server D and other servers pairs (Ai, Bi) from the list L. For each server pair (Ai, Bi) 
follow for acting on global request messages, step 4 even- on the list LI, form a locate message M(Ai, Bi), which is a 
tually causes all core servers for topic Ck to act on request copy of message M whose S;^, field has been modified to 
R and therefore store a local copy of file F. In order to make specify Bi, and transmit this message M(Ai, Bi) to server Ai 
room for file F on its local storage device, a core server Si 45 over a virtual point-to-point connection, (e) For each reply 
may have to delete a less usefull file. There are several ways received (by S) to a message sent in step (d), act as follows: 
to choose a file to delete. One option, well known in the art, 0) ^ "positive" reply arrives to allocate message M(Ai, 
is for Si to choose to delete the least recently accessed file. Bi), then forward this reply to A,, and terminate step 4, 
In another variation, Si deletes a file that it believes few immediately, (ii) If a "negative" reply arrives to a locate 
users wiU access. In this variation, whenever a server Si 50 message M(Ai, Bi), then remove the pair (Ai, Bi) from the 
stores a copy of a file F, it also computes and stores the fist LI. (iii) If the message M(Ai, Bi) could not be success- 
weight w(Si, C^), where C^r is a cluster consisting of the fiilly delivered to Ai, then remove the pair (Ai, Bi) from the 
single target object associated with file F. Then, when server Ust LI, and add the pair (Ci, Ai) to the list LI for each Ci 
Si needs to delete a file, it chooses to delete the file F with other than Bi that is directly linked to Ai in the locally stored 
the lowest weight w(Si, C^). To reflect the fact that files are 55 subtree of MT(C). (f) Once LI no longer contains any pair 
accessed less as they age, server Si periodically multiplies its (Ai, Bi) for which a message M(Ai, Bi) has been sent, or 
stored value of w(Si, C^) by a decay factor, such as 0,95, for after a fixed period of time has elapsed, return to step (c). 
each file F that it then stores. Alter natively, instead of using Retrieving Files firom a Multicast Tree 
a decay factor, server Si may periodically recompute aggre- When a processor q in the network wishes to retrieve the 
gate interest w(Si, C^) for each file F that it stores; the 60 file associated with a given target object, it executes the 
aggregate interest changes over time because target objects following steps. These,steps are initiated by anentity E, 
typically have an age attribute that the system considers in which may be either a user entering commands via a 
estimating user interest, as described above. keyboard at a client q, as illustrated in FIG. 3, or an 

If entity E later wishes to remove file F from the network, automatic software process resident on a client or server 

for example because it has just multicast an updated version, 65 processor q. 1. Processor q forms a query Q that asks 

it pseudonymously transmits a digitally signed global whether the recipient (a core server for cluster C) still stores 

request message to proxy-server D, requesting all proxy a file F that was previously multicast to the multicast tree 
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MT(C); if so, the recipient server should reply with its own 
server name Note that processor q must akeady know the 
name of file F and the identity of cluster C; typically, this 
information is provided to entity E by a service such as the 
news clipping service or browsing system described below, 5 
which must identify files to the user by (name, multicast 
topic) pair. 2. Processor q forms a query message M that 
poses query Q to the multicast tree MT(C). 3. Processor q 
pscudonymously transmits message M to the user's proxy 
server D, as described above. 4. Processor q receives a lo 
response M2 to message M. 5. If the response M2 is 
"positive," that is, it names a server S that still stores file F, 
then processor q pseudonymously instructs the user's proxy 
server D to retrieve file F from server S. If the retrieval fails 
because server S has deleted file F since it answered the 15 
query, then cHent q returns to step 1. 6. If the response M2 
is "negative," that is, it indicates that no server in MT(C) still 
stores file F, then processor q forms a query Q that asks the 
recipient for the address A of the entity that maintains file F; 
this entity will ordinarily maintain a copy of file F indefi- 20 
nitely. All core servers in MT(C) ordinarily retain this 
information (unless instructed to delete it by the maintaining 
entity), even if they delete file F for space reasons. 
Therefore, processor q should receive a response providing 
address A, whereupon processor q pseudonymously 25 
instructs the user's proxy server D to retrieve file F from 
address A. 

When multiple versions of a file F exist on local servers 
throughout the data communication network N, but are not 
marked as alternate versions of the same file, the system's 30 
ability to rapidly locate files similar to F (by treating them 
as target objects and applying the methods disclosed in 
"Searching for TargetObjects" above) makes it possible to 
find all the alternate versions, even if they are stored 
remotely. These related data files may then be reconciled by 35 
any method. In a simple instantiation, all versions of the data 
file would be replaced with the version that had the latest 
date or version number. In another instantiation, each ver- 
sion would be automatically annotated with references or 
pointers to the other versions. 40 

NEWS CLIPPING SERVICE 

The system for customized electronic identification of 
desirable objects of the present invention can be used in the 
electronic media system of FIG. 1 to implement an auto- 45 
matic news clipping service which learns to select (filter) 
news articles to match a user's interests, based solely on 
which articles the user chooses to read. The system for 
customized electronic identification of desirable objects 
generates a target profile for each article that enters the 50 
electronic media system, based on the relative frequency of 
occurrence of the words contained in the article. The system 
for customized electronic identification of desirable objects 
also generates a search profile set for each user, as a function 
of the target profiles of the articles the user has accessed and 55 
the relevance feedback the user has provided on these 
articles. As new articles are received for storage on the mass 
storage systems SSj-SS„ of the information servers I^-I^, 
the system for customized electronic identification of desir- 
able objects generates their target profiles. The generated 60 
target profiles are later compared to the search profiles in the 
users' search profile sets, and those new articles whose tar 
get profiles* are closest (most similar) to the closest search 
profile in a user's search profile set are identified to that user 
for possible reading. The computer program providing the 65 
articles to the user monitors how much the user reads (the 
number of screens of data and the number of minutes spent 
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reading), and adjusts the search profiles in the user's search 
profile set to more closely match what the user apparently 
prefers to read. The details of the method used by this system 
are disclosed in flow diagram form in FIG. 5. Ih is method 
requires selecting a specific method of calculafing user- 
specific search profile sets, of measuring similarity between 
two profiles, and of updating a user's search profile 'set (or 
more generally target profile interest summary) based on 
what the user read, and the examples disclosed herein are 
examples of the many possible implementations that can be 
used and should not be construed to limit the scope of the 
system. 

Initialize Users' Search Profile Sets 

The news clipping service instantiates target profile inter- 
est summaries as search profile sets, so that a set of high- 
interest search profiles is stored for each xiser. The search 
profiles associate d with a given user change over time. As 
in any application involving search profiles, they can be 
initially determined for a new user (or explicitly altered by 
an existing user) by any of a number of procedures, includ- 
ing the following preferred methods: (1) asking the user to 
specify search profiles directly by giving keywords and/or 
numeric attributes, (2) using copies of the profiles of target 
objects or target clusters that the user indicates are repre- 
sentative of his or her interest, (3) using a standard set of 
search profiles copied or otherwise determined from the 
search profile sets of people who are demographicaUy 
similar to the user. 

Retrieve New Articles from Article Source 

Articles are available on-line from a wide variety of 
sources. In the preferred embodiment, one would use the 
current days news as suppUed by a news source, such as the 
AP or Reuters news wire. These news articles are input to the 
electronic media system by being loaded into the mass 
storage system SS4 of an information server S4. The article 
profile module 201 of the system for customized electronic 
identification of desirable objects can reside on the infor- 
mation server S4 and operates pursuant to the steps illus- 
trated in the flow diagram of FIG. 5, where, as each article 
is received at step 501 by the information server. S^, the 
article profile module 201 at step 502 generates a target 
profile for the article and stores the target profile in an article 
indexing memory (typically part of mass storage system SS4 
for later use in selectively delivering articles to users. This 
method is equally useful for selecting which articles to read 
from electronic news groups and electronic bulletin boards, 
and can be used as part of a system for screening and 
organizing electronic mail ("e-mail"). 
Calculate Article Profiles 

A target profile is computed for each new article, as 
described earlier. The most important attribute of the target 
profile is a textual attribute that stands for the entire text of 
the article. TTiis textual attribute is represented as described 
earher, as a vector of numbers, which numbers in the 
preferred embodiment incl\ide the relative frequencies (TF/ 
IDF scores) of word occurrences in this article relative to 
other comparable articles. The server must count the fre- 
quency of occurrence of each word in the article in order to 
compute the TF/IDF scores. 

These news articles are then hierarchically clustered in a 
hierarchical cluster tree at step 503, which serves as a 
decision tree for determining which news articles are closest 
to the user's interest. The resulting clusters can be viewed as 
a tree in which the top of the tree includes all target objects 
and branches further down the tree represent divisions of the 
set of target objects into successively smaller subclusters of 
target objects. Each cluster has a cluster profile, so that at 
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each node of the tree, the average target profile (centroid) of the average of all search profiles in subtree S 3, For each 

all target objects stored in the subtree rooted at that node is subcluster (child subtree) T of the root of the target profile 

stored This average of target profiles is computed over the cluster tree (or, let T be the entire target profile cluster tree 

representation of target profiles as vectors of numeric if it contains only one target profile): 4. Compute the cluster 

attributes, as described above. 5 Profile ?t to be the average of aU target profiles in sub^ee 

Compare Current Articles^ Target Profiles to a User's Search T 5. Caloilate d(P,, Py^ th^ distance between P, and 6. If 

p ^[ ^ d(P^, Pj)<t, a threshold, 7. If S contauis only one search 

fr^tes .... 1 *t.- * * profile and T contains only one target profile, declare a 

nie process by which a user empbys this apparatus to ^^^^^^ ^^^^^^ ; g 

retneve news articles of interest is illustrated m flow dm- ^^^^^ ^^^^ \ ^ matches between 

gram form in FIG. 11. At step 1101 the user logs into the 10 ^^^^^^ ^^^^^^^ ^^^^ g ^^^^^ p^^^l^^ ^^^^ j 

data communication network N via their client processor C, threshold used in step 6 is typically an afBne function 

and activates the news reading program. This is accom- ^jj^^j. function of the greater of the cluster variances (or 

pUshed by the user establishing a pseudonymous data com- ^.i^^pj. diameters) of S and X Whenever a match is declared 

munications connection as described above to a proxy server between a search profile and a target profile, the target object 

$2, which provides front-end access to the data communi- 15 that contributed the target profile is identified as being of 

cation network N. The proxy server maintains a list of interest to the user who contributed the search profile. Notice 

authorized pseudonyms and their corresponding public keys that the process can be applied even when the set of users to 

and provides access and billing control; The user has a be considered or the set of target objects to be considered is 

search profile set stored in the local data storage medium on very small. In the case of a single user, the process reduces 

the proxy server S2. When the user requests access to "news" 20 to the method given for identifying articles of interest to a 

at step 1102, the profile matching module 203 resident on single user. In the case of a single target object, the process 

proxy server Si sequentially considers each search profile Pjt constitutes a method for identifying users to whom that 

from the user's search profile set to determine which news target object is of interest, 

articles are most hkely of interest to the user. The news Present list of Articles to User 

articles were automatically clustered into a hierarchical 25 Once the profile correlation step is completed for a 

cluster tree at an earlier step so that the determination can be selected user or group of users, at step 1104 the profile 

made rapidly for each user. The hierarchical cluster tree processing module 203 stores a list of the identified articles 

serves as a decision tree for determining which articles' for presentation to each user. At a user's request, the profile 

target profiles are most similar to search profile p^t: the processing system 203 retrieves the generated list of relevant 

search, for relevant articles begins at the top of the tree, and 30 articles and presents this list of titles of the selected articles 

at each level of the tree the branch or branches are selected to the user, who can then select at step 1105 any article for 

which have cluster profiles closest to p^t- This process is viewing. (If no tides are available, then the first sentence(s) 

recursively executed until the-leaves of the tree are reached, of each article can be used.) The list of article titles is sorted 

identifying individual articles of interest to the user, as according to the degree of similarity of the article's target 

described in the section "Searching for Target Objects" 35 profile to the most similar search profile in the user's search 

^5Qyg profile set. The resulting sorted list is either transmitted in 

A variation on this process exploits the fact that many realtime to the user client processor Cj, if the user is present 

users have similar interests. at their client processor Cj, or can be transmitted to amuser's 

Rather than carry out steps 5-9 of the above process mailbox, resident on the user's cUent processor C^ or stored 

separately for each search profile of each user, it is possible 40 within the server S^ for later retrieval by the user; other 

to achieve added efficiency by carrying out these steps only methods of transmission include facsimile transmission of 

once for each group of similar search profiles, thereby the printed list or telephone transmission by means of a 

satisfying many users' needs at once. In this variation, the text-to-speech system. The user can then transmit a request 

system begins by non-hierarchically clustering aU the search by computer, facsimile, or telephone to indicate which of the 

profiles in the search profile sets of a large number of users. 45 identified articles the user wishes to review, if any. The user 

For each cluster k of search profiles, with cluster profile p;^ can still access all articles in any information server S4 to 

it uses the method described in the section "Searching for which the user has authorized access, however, those lower 

Target Objects" to locate articles with target profiles similar on the generated Ust are simply further from the user's 

to pjt- Each located article is then identified as of interest to interests, as determined by the user's search profile set. The 

each user who has a search profile represented in cluster k 50 server 82 retrieves the article firom the local data storage 

of search profiles. medium or from an information server S4 and presents the 

Notice that the above variation attempts to match clusters article one screen at a time to the user's client processor C^. 

of search profiles with similar clusters of articles. Since this The user can at any time select another article for reading or 

is a symmetrical problem, it may instead be given a sym- exit the process, 

metrical solution, as the following more general variation 55 Monitor Which Articles Are Read 

shows. At some point before the, matching process The user's search profile set generator 202 at step 1107 

commences, all the news articles to be considered are monitors which articles the user reads, keeping track of how 

clustered into a hierarchical tree, termed the "target profile many pages of text are viewed by the user, how much time 

cluster tree," and the search profiles of all users to be is spent viewing the article, and whether all pages of the 

considered are clustered into a second hierarchical tree, 60 article were viewed. This information can be combined to 

termed the "search profile cluster tree ."The following steps measure the depth of the user's interest in the article, 

serve to find all matches between individual target profiles yielding a passive relevance feedback score, as described 

from any target profile cluster tree and individual search earlier. Although the exact details depend on the length and 

profiles from any search profile cluster tree: 1. For each chUd nature of the articles being seardied, a typical fonnula might 

subtree S of the root of the search profile cluster tree (or, let 65 be: 

S be the entire search profile cluster tree if it contains only measure of article attractivencss=0.2 if the second page is 

one search profile): 2. Compute the cluster profile P^ to be accessed+0.2 if all pages are accessed+0.2 if more than 
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30 seconds was spent on the article +0.2 if more than 
one minute was spent on the article+0.2 if the minutes 
spent in the article are greater than half the number of 
pages. 

The computed measure of article attractiveness can then 
be Tised as a weighting function to adjust the user's search 
profile set to thereby more accurately reflect the user's 
dynamically changing interests. 
Update User Profiles 

Updating of a user's generated search profile set can be 
done at step 1108 using the method described in copending 
U.S. patent application Ser. No. 08/346,425. When an article 
is read, the server Sj shifts each search profile in the set 
slightly in the direction of the target profiles of those nearby 
articles for which the computed measure of article attrac- 
tiveness was high. Given a search profile with attributes u,-;^ 
from a iiser's search profile set, and a set of J articles 
available with attributes d^j^ (assumed correct for now), 
where I indexes users, j indexes articles, and k indexes 
attributes, user I would be predicted to pick a set of P distinct 
articles to minimize the sum of d(uy^, bj) over the chosen 
articles j . The user's desired attributes u,-^ and an article's 
attributes d.^^ would be some form of word frequencies such 
as TF/IDF and potentially other attributes such as the source, 
reading level, and length of the article, while d{Uf, dj) is the 
distance between these two attribute vectors (profiles) using 
the similarity measure described above. If the user picks a 
different set of P articles than was predicted, the user search 
profile set generation module should try to adjust u and/or d 
to more accurately predict the articles the user selected. In 
particular, u^^ and/or dj sho;ild be shifted to increase their 
similarity if user I was predicted not to select article j but did 
select it, and perhaps also to decrease their similarity if user 
I was predicted to select article j but did not. A preferred 
method is to shift u for each wrong prediction that msct I will 
not select article j, using the formula: 

Here Uy is chosen to be the search profile from user Ts 
search profile set that is closest to target profile. If e is 
positive, this adjustment increases the match between user 
Ts search profile set and the target profiles of the articles 
user I actually selects, by making U;^ closer to dj for the case 
where the algorithm failed to predict an article that the 
viewer selected. The size of e determines how many 
example articles one must see to change the search profile 
substantially. If e is too large, the algorithm becomes 
unstable, but for sufficiently small e, it drives u to its correct 
value. In general, e should be proportional to the measure of 
article attractiveness; for example, it should be relatively 
high if user I spends a long time reading article j. One could 
in theory also use the above formula to decrease the match 
in the case where the algorithm predicted an article that the 
user did not read, by making e negative in that case. 
However, there is no guarantee that u will move in the 
correct direction in that case. One can also shift the attribute 
weights Wy of user I by using a similar algorithm: 

This is particularly important if one is combining word 
frequencies with other attributes. As before, this increases 
the match if e is positive — for the case where the algorithm 
failed to predict an article that the user read,this time by 
decreasing the weights on those characteristics for which the 
user's target profile u^ differs from the article's profile dy. 
Again, the size of e determines how many example articles 
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one must see to replace, what was originally believed. 
Unlike the procedure for adjusting u, one also make use of 
the fact that the above algorithm decreases the match if e is 
negative — ^for the case where the algorithm predicted an 

5 article that the user did not read. The denominator of the 
expression prevents weights from shrinking to zero over 
time by renormalizing the modified weights w/ so that they 
sum to one. Both u and w can be adjusted for each article 
accessed. When e is small, as it should be, there is no conflict 

10 between the two parts of the algorithm. The selected user's 
search profile set is updated at step 1108. 
Further Applications of the Filtering Technology 

The news clipping service may deliver news articles (or 
advertisements and coupons for purchasables) to off-line 

15 users as well as to users who are on-line. Although the 
off-line users may have no way of providing relevance 
feedback, the user profile of an off-line user U may be 
similar to the profiles of on-Hne users, for example because 
user U is demographically similar to these other users, and 

20 the level of user U's interest in particular target objects can 
therefore be estimated via the general interest-estimation 
methods described earlier. In one application, the news 
clipping service chooses a set of news articles (respectively, 
advertisements and coupons) that are predicted to be of 

25 interest to xiser U, thereby determining the content of a 
customized newspaper (respectively, advertising/coupon 
circular) that may be printed and physically sent to user U 
via other methods. In general, the target objects included in 
the printed document dehvered to user U are those with the 

30 highest median predicted interest among a group G of users, 
where group G consists of either the single off-line user U, 
a set of off-line users who are demographically similar to 
user U, or a set of off-line users who are in the same 
geographic area and thus on the same newspaper deUvery 

35 route. In a variation, user group G is clustered into several 
subgroups Gl . . . Gk; an average user profile Pi is created 
from each subgroup Gi; for each article T and each user 
profile Pi, the interest in T by a hypothetical user with user 
profile Pi is predicted, and the interest of article T to group 

40 G is taken to be the maximum interest in article T by any of 
these k, hypothetical users; finally, the customized newspa- 
per for user group G is constructed from those articles of 
greatest interest to group G. 
The filtering technology of the news clipping service is 

45 not Umited to news articles provided by a single source, but 
may be extended to articles or target objects collected from 
any number of sources. For example, rather than identifying 
new news articles of interest, the technology may identify 
new or updated World Wide Web pages of interest. In a 

50 second application, termed "broadcast clipping," where 
individual users desire to broadcast messages to all inter- 
ested users, the pool of news articles is replaced by a pool 
of messages to be broadcast, and these messages are sent to 
the broadcast-clipping-service subscribers most interested in 

55 them. In a third appfication, the system scans the transcripts 
of all real-time spoken or written discussions on the network 
that are currently in progress and designated as public, and 
employs the news-clipping technology to rapidly identify 
discussions that the user may be interested in joining, or to 

60 rapidly identify and notify users who may be interested in 
joining an ongoing discussion. In a fourth application, the 
method is used as a post-process that fillers and ranks in 
order of interest the many target objects found by a con- 
ventional database search, such as a search for all homes 

65 selling for under $200,000 in a given area, for all 1994 news 
articles about Marcia Clark, or for all I ta Han-language films. 
In a fifth application, the method is used to filter and rank the 
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links in a hypertext document by estimating the user's 
interest in the document or other object associated with each 
link. In a sixth application, paying advertisers, who may be 
companies or individuals, are the, source of advertisements 
or other messages, which take the place of the news articles 
in the news clipping service. A consumer who buys a 
product is deemed to have provided positive relevance 
feedback on advertisements for that product, and a consumer 
who buys a product apparently because of a particular 
advertisement (for example, by using a coupon clipped from 
that advertisement) is deemed to have provided particularly 
high relevance feedback on that advertisement. Such feed- 
back may be communicated to a proxy server by the 
consumer's client processor (if the consumer is making the 
purchase electronically), by the retail vendor, or by the 
credit-card reader (at the vendor's establishment) that the 
consumer uses to pay for the purchase. Given a database of 
such relevance feedback, the disclosed technology is then 
used to match advertisements with those users who are most 
interested in them; advertisements selected for a user are 
presented to that user by any one of several means, including 
electronic mail, automatic display on the users screen, or 
printing them on a printer at a retail establishment where the 
consumer is paying for a purchase. The threshold distance 
used to identify interest may be increased for a particular 
advertisement,: causing the system to present that advertise- 
ment to more users, in accordance with the amount that the 
advertiser is willing to pay. 

A further use of the capabilities of this system is to 
manage a user's investment portfolio. Instead of recom- 
mending articles to the user, the system recommends target 
objects that are investments. As illustrated above by the 
example of stock market investments, many different 
attributes can be used together to profile each investment. 
The user's past investment behavior is characterized in the 
user's search profile set or target profile interest summary, 
and this information is used to match the user with stock 
opportunities (target objects) similar in nature to past invest- 
ments. Th e rapid profiUng method described above may be 
used to determine a rough set of preferences for new users. 
Quality attributes used in this system can include negatively 
weighted attributes, such as a measurement of fluctuations in 
dividends historically paid by the investment, a quality 
attribute that would have a strongly negative weight for a 
conservative investor dependent on a regular flow of invest- 
ment income. Furthermore, the user can set filter parameters 
so that the system can monitor stock prices and automati- 
cally take certain actions, such as placing buy or sell orders, 
or e-mailing or paging the user with a notification, when 
certain stock performance characteristics are met. Thus, the 
system can immediately notify the user when a selected 
stock reaches a predetermined price, without the user having 
to monitor the stock market activity. The user's investments 
can be profiled in part by a "type of investment" attribute (to 
be used in conjunction with other attributes), which distin- 
guishes among bonds, mutual funds, growth stocks, income 
stocks, etc., to thereby segment the user's portfolio accord- 
ing to investment type. Each investment type can then be 
managed to identify investment opportunities and the user 
can identify the desired ratio of investment capital for each 
type, e.g., in accordance with the system's automatic rec- 
ommendation for relative distribution of investment capital 
as indicated by the relative level of user interest for each 
type. 

In one application, the system may also keep track of and 
recommend, notify (or page for new releases and new 
articles) of important articles which are most interesting to 
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other users who have a simflar stock portfolio to that of the 
user. Relevance feedback in this application determines the 
relevance of the associative attributes (each stock) with the 
relevant textual attribute contained in the free text of the 
article's Other descriptors plus any relevant numeric 
attributes contained in the articles. Additionally, one could 
bias the weighting values of users providing relevance 
feedback to favor those who have invested in similar types 
of stocks and who have a proven track record of success 
through their trading decisions. Another application for 
which this preadjusted relevance feedback is useful is in 
recommending and/or automatically trading the most inter- 
esting stocks to users using the coUaborative filtering meth- 
ods above described. However, biasing the relevance feed- 
back to the system by those users who had been most 
successful in their trading decisions in the past with regards 
to similar types of stocks. Accordingly, in accordance with 
the similarity techniques-of articles and stocks which are 
most relevant to one another. 

Because there arc numerous methods which are used to 
attempt to predict for users both stocks and optimal times to 
buy or trade, the cunent user customization techniques are 
best implemented as an enhancement feature to not only 
provide the user with quaUty but also customization. 

In the preferred implementation for an on-line newspaper 
or news filter, each of these capabilities for customized 
recommendation notification of invested related articles, 
stock recommendations and automated monitoring and trad- 
ing features are provided to the user as an integrated 
financial news and investment service. Additionally, in 
accordance with the virtual communities section below 
described, users sharing common portfolios may wish to 
correspond on-line to share advice or experiences with other 
simflar users. Again, users would have a past track record of 
success may also be identifiable through these virtual com- 
munities in conjunction with their participation in these 
communities or their comments and advice relating to 
specific stocks may be ascribed to those stocks (and made 
publicly available). 

OTHER ON-LINE NEWSPAPER INTERFACE 
FEATURES 



45 



In accordance with current on-line news interface 
features, several implementation features of the present 
system include the following: 

1 . Automatically create a "customized newspaper". 
User profiling enabhng ctistom recommendations may be 

achieved by purely passive means of user activity data or if 
50, desired, it can refine and automate the seleaion process of 
articles within user selected categories of interest as well as 
recommend articles within different categories which the 
user is Ukely to prefer as evidenced through past behaviors. 
Applications include: 

(a) Presentation of new articles and corresponding adver- 
tisements which are of highest interest to the user. 

(b) Recommending (highlighting) these articles from the 
directory. 

2. A customized search engine which offers search results 
which are tailored and relevancy ranked to user preferences. 

3. Using a survey for off-line users for subsequent issues, 
an inserted card inserted into each issue identifies or priori- 
dzcs the most interesting articles/ads. 
E-Mail Filter 

In addition to the news clipping service described above, 
the system for customized electronic identification of desir- 
able objects functions in an e-mail environment in a similar 
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but slightly different manner. The news clipping service 
selects and retrieves news information that would not oth- 
erwise reach its subscribers. But at the same time, large 
numbers of e-mail messages do reach users, having been 
generated and sent by humans or automatic programs. These 
users need an e-mail filter, which automatically processes 
the messages received. The necessary processing includes a 
determination of the action to be taken with each message, 
including, but not limited to: filing the message, notifying 
the user of receipt of a high priority message, automatically 
responding to a message. The e-mail filter system must not 
require too great an investment on the part of the user to 
Icam and use, and the user must have confidence in the 
appropriateness of the actions automatically taken by the 
system. The same filter may be applied to voice mail 
messages or facsimile messages that have been converted 
into electronically stored text, whether automatically or at 
the user's request, via the use of w ell-known techniques for 
speech recognition or optical character recognition. 

The filtering problem can be defined as follows: a mes- 
sage processing fiinction MPF(*) maps from a received 
message (document) to one or more of a set of actions. The 
actions, which may be quite specific, may be either pre- 
defined or customized by the use r. Each action A has an 
appropriateness function (*,*) such that F^ (U,D) returns 
a real number, representing the appropriateness of selecting 
action A on behalf of user U when user U is in receipt of 
message D. For example, if D comes from a credible source 
and is marked urgent, then discarding the message has a high 
cost to the user and has low appropriateness, so that V^^card 
(U,D) is snaall, whereas alerting the user of receipt of the 
message is highly appropriate, so that F„,^„ (U J^) is large. 
Given the determined appropriateness function, the function 
MPF(D) is used to automatically select the appropriate, 
action or actions. As an example, the following set of actions 
might be useful: 

1. Urgently notify user of receipt of message 

2. Insert message into queue for user to read later 

3. Insert message into queue for user to read later, and 
suggest that user reply 

4. Insert message into queue for user to read later, and 
suggest that user forward it to individual R 

5. Summarize message and insert summary into queue 

6. Forward message to user's secretary 

7. File message in directory X 

8. File message in directory Y 

9. Delete message (i.e., ignore message and do not save) 

10. Notify sender that further messages on this subject are 
unwanted 

Notice that actions 8 and 9 in the sample fist above are 
designed to filter out messages that are undesirable to the 
user or that are received from undesirable sources, such as 
pesky salespersons, by deleting the unwanted message and/ 
or sending a reply that indicates that messages of this type 
will not be read. The appropriateness functions must be 
tailored to describe the appropriateness of carrying out each 
action given the target profile for a particular document, and 
then a message processing function MPF can be found 
which is in some sense optimal with respect to the appro- 
priateness function. One reasonable choice of MPF always 
picks the action with highest appropriateness, an d in. cases 
where multiple actions are highly appropriate and are also 
compatible with each other, selects more than one action: for 
example, it may automatically reply to a message and also 
file the same message in directory X, so that the value of 
MPF(D)) is the set {reply, file in directory X 
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}. I n cases where the appropriateness of even the m ost 
appropriat e action falls below a user-specified threshold, a s 
s hould happen for messages of an unfamiliar type, th e 
s^tem gjc s t bg user fo r continnation o f the action (s) 

5 s elected by MPF. In addTtion, in cases where MFF selects 
one action over another action that is nearly as appropriate, 
the system al.sn afjt g the user for c onfirmation : for^cxamplc, 
mai l should not be deleted if it is nearly as appropria te to let 
tBe us er see it. ~~ 

10 11 IS possible to write appropriateness fuinctions manually, 
but the time necessary and lack of user expertise render this 
solution impractical. The automatic training of this system- 
is preferable, using the automatic user profiling system 
described above. Each received document is viewed as a 

15 target object whose profile includes such attributes as the 
entire text of the document (represented as TF/TDF scores), 
document sender, date sent, document length, date of last 
document received from this sender, key words, hst of other 
addressees, etc. It was disclosed above how to estimate an 

20 interest function on profiled target objects, using relevance 
feedback together with measured similarities among target 
objects and among users. In the con text of the e-mail filter, 
the task is to estimate, several appropriateness fuinctions F^ 
(*,*), one per action Jhis is handled with exactly the s^ me 

25 m ethod as was used earlier to estimate the topical intere st 
function f (*,*) Relevance feedback in this case is provide d 
b y me user s ob served actions over time: whenever user U 
ch WSfcs action A^n document D, either freely or by cho os- 
in g or ooniirming an actio n^ r ecommended by the sasl^ , 

30 t h^is t^^TtTOean t hat ilTe"app ropriatenessof action A on 
I do cument D is high^ particularly if the user takes this a ction 
K A immediatelv aftcr seeing d ffnfTnftnrp ^ A presumptioa of 
'^a o ^ appi eypfTateness (corresponding to the earlier 
presumption-of no interest) is used so that action A is 

35 considered inappropriate on a document unless the user or 
similar users have taken action Aon this document or similar 
documents. In particular, if do similar document has been 
seen, no action is considered especially appropriate, and the 
e-mail filter asks the user to specify the appropriate action or 

40 confirm that the action chosen by the e-mail filter is the 
appropriate one. 

Thus, the e-mail filter learns to take particular actions on 
e-mail messages that have certain attributes or combinations 
of attributes. For example, messages from John Doe that 

45 originate in the, (212) area code may prompt the system to 
forward a copy by fax transmission to a given fax number, 
or to file the message in directory X on the user's client 
processor. A variation allows active requests of this form 
from the user, such as a request that any message from. John 

50 Doe be forwarded to a desired fax number until further 
notice. This active user input requires the use of a natural 
language or form-based interface for which specific com- 
mands are associated with particular attributes and combi- 
nations of attributes. 

55 Update Notification 

A very important and novel characteristic of the architec- 
ture is the ability to identify new or updated target objects 
that are relevant to the user, as determined by the user's 
search profile set or target profile interest summary. 

60 ("Updated target objects" include revised versions of docu- 
ments and new models of purchasable goods.) The system 
may notify the user of these relevant target objects by an 
electronic notification such as an e-mail message or fac- 
simile transmission. In the variation where the system sends 

65 an e-mail message, the uscr*s e-mail filter can then respond 
appropriately to the notification, for instance, by bringing 
the notification immediately to the user's personal attention, 



01/14/2003, EAST Version: 1.03.0002 



us 6,460^ 

65 

or by automatically submittiDg an electronic request to 
purchase the target object nanied in the notification. A 
simple example of the latter response is for the e-mail filter 
to retrieve an on-line document at, a nominal or zero charge, 
or request to buy a purchasable of limited quantity such as 5 
a used product or an auctionable. 

ACnVE NAVlGAnON (BROWSING) 
Browsing by Navigating Through a Cluster Tree 

A hierarchical cluster tree imposes a useful organization 
on a collection of target objects. The tree is of direct use to 10 
a user who wishes to browse through all the target objects in 
the tree. Such a user may be exploring the collection with or 
without a well-specified goal. The tree's division of target 
objects into coherent clusters provides an efiScient method 
whereby the user can locate a target object of interest. The 15 
user first chooses one of the highest level (largest) clusters 
from a menti, and is presented with a menu listing the 
subdlusters of said cluster, whereupon the user may select 
one of these subdlusters. The system locates the subdluster, 
via the appropriate pointer that was stored with the larger 20 
cluster, and allows the user to select one of its subclusters 
from another menu. This process is repeated until the user 
comes to a leaf of the tree, which yields the details of an 
actual target object. Hierarchical trees allow rapid selection 
of one target object from a large set. In ten menu selections 25 
from menus of ten items (subclusters) each, one can reach 
10^°olO,000,000,000 (ten billion) items. In the preferred 
embodiment, the user views the menus on a computer screen 
or terminal screen and selects from them with a keyboard or 
mouse. However, the user may also make selections over the 30 
telephone, with a voice synthesizer reading the menus and 
the user selecting subclusters via the telephone's touch-tone 
keypad. In another variation, the user simultaneously main- 
tains two connections to the server, a telephone voice 
connection and a fax connection; the server sends successive 35 
menus to the user by fax, while the user selects choices via 
the telephone's touch-tone keypad. 

Just as user profiles commonly include an associative 
attribute indicating the user's degree of interest in each 
target object, it is useful to augment user profiles with an 40 
additional associative attribute indicating the user's degree 
of interest in each cluster in the hierarchical cluster tree. This 
degree of interest may be estimated numerically as the 
number of subclusters or target objects the user has selected 
from menus associated with the given cluster or its 45 
subclusters, expressed as a proportion of the total number of 
subclusters or target objects the user has selected. This 
associative attribute is particularly valuable if the hierarchi- 
cal tree was built using "soft" or "fuzzy" clustering, which 
allows a subcluster or target object to appear in multiple so 
clusters: if a target document appears in both the "sports" 
and the ".humor" clusters, and the user selects it torn a 
menu associated with the "humor" cluster, then the system 
increases its association between the user and the "humor" 
cluster but not its association between the user and the 55 
"sports" cluster. 
Labeling Clusters 

Since a user who is navigating the cluster tree is repeat- 
edly expected to select one of several subclusters from a 
menu, these subclusters must be usefully labeled (at step 60 
503), in such a way as to suggest their content to the human 
user. It is straightforward to include some basic information 
about each subcluster in its label, such as the number of 
target objects the subdluster contains (possibly just 1) and 
the number of these that have been added or updated 65 
recendy. However, it, is also necessary to display additional 
information that indicates the cluster's content. This 
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content-descriptive information may be provided by a 
human, particularly for large or frequently accessed clusters, 
but it may also be generated automatically. The basic 
automatic technique is simply to display the cluster's "char- 
acteristic value" for each of a few highly weighted attributes. 
With numeric attributes, this may be taken to mean the 
cluster's average value for that attribute: thus, if the "year of 
release"attribute is highly weighted in predicting which 
movies a user will like, then it is useful to display average 
year of release as part of each cluster's label. Thus the user 
sees that one cluster consists of movies that were released 
around 1962, while another consists of movies from around 
1982. For short textual attributes, such as "title of movie" or 
"title of document," the system can display the attribute's 
value for the cluster member (target object) whose profile is 
most similar to the cltistcr's profile (the mean profile for all 
members of the cluster), for example, the title of the most 
typical movie in the cluster. For longer textual attributes, a 
useful technique is to select those terms for which the 
amount by which the term's average TF/IDF score across 
members of the cluster exceeds the term's average TF/IDF 
score across all tar get objects is greatest, either in absolute 
terms or else as a fraction of the standard deviation of the 
term's TF/IDF score across all target objects. The selected 
terms are replaced with their morphological stems, elimi- 
nating duplicates (so that if both "slept" and "sleeping" were 
selected, they would be replaced by the single term "sleep") 
and optionally eliminating close synonyms or collocates (so 
that if both "nurse" and "medical" were selected, they might 
both be replaced by a single term such as, "nurse," 
"medical," "medicine," or "hospital"). The resulting set of 
terms is displayed as part of the label. Finally, if freely 
redistributable thumbnail photographs or other graphical 
images are associated with some of the target objects in the 
cltister f or labeling purposes, then the system can display as 
part of the label the image or images whose associated target 
objects have target profiles most similar to the cluster 
profile. 

Users' navigational patterns may provide some useful 
feedback as to the quality of the labels. In particular, if users 
often select a particular cluster to explore, but then quickly 
backtrack and try a different cluster, this may signal that the 
first cluster's label is misleading. Insofar as other terms and 
attributes can pro video "next-best" alternative labels for the 
first cluster, such "next-best" labels can be automatically 
substituted for the misleading label. In addition, any user can 
locally relabel a cluster for his or her own convenience. 
Although a cluster label provided by a user is in general 
visible only to that user, it is possible to make global use of 
these labels via a "user labels" textual attribute for target 
objects, which attribute is defined for a given target object to 
be the concatenation of all label s provided by any user for 
any cluster containing that target object. This attribute 
influences similarity judgments: for example, it may induce 
the system to regard target articles in a cluster often labeled 
"Sports News" by users as being mildly similar to articles in 
an otherwise dissimilar cluster often labeled "International 
News" by users, precisely because the "user labels" attribute 
in each cluster profile is strongly associated with the term 
"News." The "user label" attribute is also used in the 
automatic generation of labels, just as other textual attributes 
are, so that if the user-generated labels for a cluster often 
include "Sports," the term "Sports" may be included in the 
automatically generated label as well. 

It is not necessary for menus to be displayed as simple 
fists of labeled options; it is possible to display or print a 
menu in a form that shows in more detail the relation of the 



01/14/2003, EAST Version: 1.03.0002 



us 6,460,036 Bl 

67 68 

different menu options to each other Thus, in a variation, the system uses the method of section "Searching for Target 
menu options are visually laid out in two dimensions or in Objects" above to automatically locate a small set of one or 
a perspective drawing of three dimensions. Each option is more clusters with profiles similar to the query profile, for 
displayed or printed as a textual or graphical label. The example, the articles they contain are written at roughly an 
physical coordinates at which the options are displayed or 5 8th-grade level and tend to mention Galileo and the Medicis. 
printed are generated by the following sequence of steps: (1) The user may start browsing at any of these clusters, and can 
construct for each option the cluster profile of the cluster it move firom it to subclusters, superclusters, and other nearby 
represents, (2) construct from each cluster profile its decom- clusters. For a user who is looking for something in 
position into a numeric vector, as described above, (3) apply particular, it is generally less efiScicnt to start at the largest 
singular, value decomposition (SVD) to determine the set of 10 cluster and repeatedly select smaller subclusters than it is to 
two or three orthogonal linear axes along which these write a brief description of what one is looking for and then 
numeric vectors are most greatly differentiated, and (4) take to move to nearby clusters if the objects initially recom- 
the coordinates of each option to be the projected coordi- mended are not precisely those desired, 
nates of that option's numeric vector along said axes. Step Although it is customary in information retrieval systems 
(3) may be varied to determine a set of, say, 6 axes, so that 15 to match a query to a document, an interesting variation is 
step (4) lays out the options in a 6-dimensional space; in this possible where a query is- matched to an already, answered 
case the user may view the geometric projection of the question. The relevant domain is a customer service center, 
6-dimensional layout onto any plane passing through the electronic newsgroup, or Better Business Bureau where 
origin, and may rotate this viewing plane in order to see questions arc frequently answered. Each new question- 
differing configurations of the options, which emphasize 20 answer pair is recorded for future: reference as a target 
similarity with respect to differing attributes in the profiles object, with a textual attribute that specifies the question 
of the associated clusters. In the visual representation, the together with the answer provided. As explained earlier with 
sizes of the cluster labels can be varied according to the reference to document titles, the question should be 
number of objects contained in the corresponding clusters. weighted more heavily than the answer when this textual 
In a further variation, all options from the parent menu are 25 attribute is decomposed into TF/IDF scores. A query speci- 
displayed in some number of dimensions, as just described, fying "Tell me about the relation between Galileo and the 
but with the option corresponding to the current menu Medici family" as the value of this attribute therefore locates 
replaced by a more prominent subdisplay of the options on a cluster of similar questions together with their answers. In 
the current menu; optionally, the scale of this composite a variation, each question-answer pair may be profiled with 
display may be gradually increased over time, thereby 30 two separate textual attributes, one for the question and one 
increasing the area of the screen devoted to showing the for the answer. A query might then locate a cluster by 
options on the current menu, and giving the visual impres- specifying only the question attribute, or for completeness, 
sion that the user is regarding the parent cluster and "zoom- both the question attribute and the (lower- weighted) answer 
ing in" on the current cluster and its subclusters. attribute, to be the text "Tell me about the relation between 
Further Navigational 35 Galileo and the Medici family." 

It should be appreciated that a hierarchical cluster-tree The filtering technology described earlier can also aid the 

may be configured with multiple cluster selections branch- user in navigafing among the target objects. When the 

ing from each node or the same labeled clusters presented in system presents the user with a menu of subclusters of a 

the form of single branches for multiple nodes ordered in a cluster C of target objects, it can simultaneously present an 

hierarchy. In one variation, the user is able to perform lateral 40 additional menu of the most interesting target objects in 

navigation between neighboring clusters as well, by request- cluster C, so that the user has the choice of accessing a 

ing that the system search for a cluster whose cluster profile subcluster or directly accessing one of the target objects. If 

resembles the cluster profile of the currently selected cluster. this additional menu lists n target objeas, then for each I 

If this type of navigation is performed at the level of between 1 and n inclusive, in increasing order, the most 

individual objects (leaf ends), then automatic hyperlinks 45 prominent choice on this additional menu, which choice is 

may be then created as navigation occurs. This is one way denoted Top(C,i), is found by considering all target objects 

that nearest-neighbor clustering navigation may be per- in cluster C that are further than a threshold distance t from 

formed. For example, in a domain where target objects are all of Top(C,l), Top(C,2), . . . Top(C, M), and selecting the 

home pages on the World Wide Web, a colleaion of such one in which the user's interest is estimated to be highest. If 

pages could be laterally linked to create a 'Virtual mall." 50 the threshold distance t is 0, then the menu resulting from 

The simplest way to use the automatic menuing system this procedure simply displays the n most interesting objects 

described above is for the user to begin browsing at the top in cluster C, but the threshold distance may be increased to 

of the tree and moving to more specific subclusters. achieve more variety in the target objects displayed. Gen- 

However, in a variation, the user optionally provides a query erally the threshold distance t is chosen to be an affine 

consisting of textual and/or other attributes, from which 55 function or other function of the cluster variance or cluster 

query the system constructs a profile in the manner diameter of the cluster C. 

described herein; optionally altering textual attributes as As a novelty feature, the user U can "masquerade" as 
described herein before decomposing them into numenc another user V, such as a prominent intellectual or a celebrity 
attributes. Query profiles are similar to the search profiles in supermodel; as long as user U is masquerading as user V, the 
a user's search profile set, except that their attributes are 60 filtering technology will recommend articles not according 
explicitly specified by a user, most often for one- time usage, to user U's preferences, but rather according to \iser V's 
and unlike search profiles, they are not automatically preferences. Provided that user U has access to the user- 
updated to reflect changing interests. A typical query in the specific data of user V, for example because user V has 
domain of text articles might have "Tell me about the leased these data to user U for a financial consideration, then 
relation between Galileo and the Medici family" as the value 65 user U can masquerade as user V by instructing user U's 
of its "text of article" attribute, and 8 as the value of its proxyscrverS to temporarily substitute user, V's user profile 
"reading difficulty" attribute (that is, 8th -grade level). The and target profile interest summary for user U's. In a 
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variation, user U has access to an average user profile and an 
composite target profile interest summary for a group G of 
users; by instructing proxy server S to substitute these for 
user Us user-specific data, user U can masquerade as a 
typical member of group G, as is useful in exploring group 
preferences for sociological, political, or market research. 
* More generally, user U may "partially masquerade" as 
another user V or group G, by instructing proxy server S to 
temporarily replace user U's user-specific data with a 
weighted average of user U's user-specific data and the 
user-specific data for user V and group G. 
Menu Organization 

Although the topology of a hierarchical cluster tree is 
fixed by the techniques that build the tree, the hierarchical 
menu presented to the user for the user's navigation need not 
be exactly isomorphic to the cluster tree. The menu is 
typically a somewhat modified version of the cluster tree, 
reorganized manually or automatically so that the clusters 
most interesting to a user are easily accessible by the user. 
In order to automatically reorganize the menu in a user- 
specific way, the system first attempts automatically to 
identify existing clusters that arc of interest to the user. The 
system may identify a cluster as interesting because the user 
often accesses target objects in that cluster — or, in a more 
sophisticated variation, because the user is predicted to have 
high interest in the cluster's profile, using the methods 
disclosed herein for estimating interest from relevance feed- 
back. 

Several techniques can then be used to make interesting 
clusters more easily accessible. The system can at the user's 
request or at all times display a special list of the most 
interesting clusters, or the most interesting subclusters of the 
current cluster, so that the user can select one of these 
clusters based on its label and jump directly to it. In general, 
when the system constructs a list of interesting clusters in 
this way, the most prominent choice on the list, which 
choice is denoted Top(I), is found by considering all appro- 
priate clusters C that are further than a threshold distance t 
from all of Top(l), Top(2), . . . Top(l-l), and selecting the 
one in which the user's interest is estimated to be highest. 
Here the threshold distance t is opliooally dependent on the 
computed cluster variance or cluster diameter of the profiles 
in the latter cluster Several techniques that reorganize the 
hierarchical menu tree are also useful. First menus can be 
reorganized so that the most interesting subcluster choices 
appear earliest on the menu, or are visually marked as 
interesting, for example, their labels are displayed in a 
special color or type face, or are displayed together with a 
number or graphical image indicating the likely level of 
interest. Second, interesting clusters can be moved to menus 
higher in the tree, i.e., closer to the root of the tree, so that 
they are easier to access if the user starts browsing at the root 
of the tree. Third, uninteresting clusters can be moved to 
menus lower in the tree, to make room for interesting 
clusters that are being moved higher. Fourth, clusters with an 
especially low interest score (representing active dislike) can 
simply be suppressed from the menus; thus, a user with 
children may assign an extremely negative weight to the 
"vulgarity" attribute in the determination of q, so that vulgar 
clusters and documents will not be available at all. As the 
interesting clusters and the documents in them migrate, 
toward the lop of the tree, a customized tree develops that 
can be more efiBciently navigated by the particular user. If 
menus are chosen so that each menu item is chosen with 
approximately equal probabiUty, then the expected number 
of choices the xiscr has to make is minimized. If, for 
example, a user frequently accessed target objects whose 
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profiles resembled the cluster profile of cluster (a, b, d) in 
FIG. 8 then the menu in FIG. 9 could be modified to show 
the structure illustrated in FIG. 10. 
In the variation where the general techniques disclosed 

5 herein for estimating a user's interest from relevance feed- 
back are used to identify interesting clusters, it is possible 
for a user U to supply "temporary relevance feedback" to 
indicate a temporary interest that is added to his or her usual 
interests. This is done by entering a query as described 

10 above, i.e., a set of textual and other attributes that closely 
match the user's interests of the moment This query 
becomes "active," and afifects the system's determination of 
interest in either of two ways. In one approach, an active 
query is treated as if it were any other target object, and by 

15 virtue of being a query, it is taken to have received relevance 
feedback that indicates especially high interest. In an alter- 
native approach, target objects X whose target profiles are 
similar to an active query's profile arc simply considered to 
have higher quality q(U, X), in that q(U, X) is incremented 

20 by a term that increases with target object X's similarity to 
the query profile. Either strategy affects the usual interest 
estimates: clusters that match user U's usual interests (and 
have high quaUty q(*)) are still considered to be of interest, 
and clusters w hose profiles are similar to an active query are 

25 adjudged to have especially high interest. Ousters that are 
similar to both the query and the user's usual interests are 
most interesting of all The user may modify or deactivate an 
active query at any time while browsing. In addition, if the 
user discovers a target object or cluster X of particular 

30 interest while browsing, he or she may replace or augment 
the original (perhaps vague) query profile with the target 
profile of target object or cluster X, t hereby amplifying or 
refining the original query to indicate an particular interest 
in objects similar to X. For example, suppose the user is 

35 browsing through documents, and specifies an initial query 
containing the word "Lloyd's," so that the system predicts 
documents containing the word "Lloyd's" to be more inter- 
esting and makes them more easily accessible, even to the 
point of listing such documents or clusters of such 

40 documents, as described above. In particular, certain articles 
about insurance containing the phrase "Lloyd's of London" 
are made more easily accessible, as are certain pieces of 
Welsh fiction containing phrases like "Lloyd's father." The 
user browses while this query is active, and hits upon a 

45 useful article describing the relation of Lloyd's of London to 
other British insurance houses; by replacing or augmenting 
the query with the full text of this article, the user can turn 
the attention of the system to other documents that resemble 
this article, such as documents about British insurance 

50 houses, rather than Welsh folk tales. 

In a system where queries are used, it is usefull to include 
in the target profiles an associative attribute that records the 
associations between a target object and whatever terms are 
employed in queries used to find that target object. The 

55 association score of target object X with a particular query 
term T is defined to be the mean relevance feedback on 
target object X, averaged over just those accesses of target 
object X that, were made while a query containing term T 
was active, multiplied by the negated logarithm of term T's 

60 global frequency in all queries. The effect of this associative 
attribute is to increase the measured similarity of two 
documents if they are good responses to queries that contain 
the same terms. A further maneuver can be used to improve 
the accuracy of responses to a query: in the summation used 

65 to determine the quahty q(U, X) of a target object X, a term 
is included that is proportional to the sum of association 
scores between target object X and each term in the active 
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query, if any, SO that target objects that are closely associated then the more specific level of detail is displayed. It is 

with terms in an active query are determined to have higher neither necessary nor desirable to limit each item to appear- 

quality and therefore higher interest for the user. To comple- ing in one group; customers are more likely to find an object 

ment the system^s automatic reorganization of the hierar- if it is in multiple categories. Non-purchasable objects such 

chical cluster tree, the user can be given the ability to 5 as artwork, advertisements, and free samples may also be 

reorganize the tree manually, as he or she sees fit. Any added to a display of purchasable objects, if they are 

changes are optionally saved on the user's local storage associated with (liked by) substantially the same users as are 

device so that they will affect the presentation of the tree in the purchasable objects in the display, 

future sessions. For example, the user can choose to move or Network Context of the Browsmg System 

copy menu options to other menus, so that usefull clusters 10 The files associated with target objects are typically 

can thereafter be chosen directly from the root menu of the distributed across a large number of different servers Sl-So 

tree or from other easily accessed or topically appropriate and chents Cl-Cn. Each file has been entered into the data 

menus. In an other example, the user can select clusters Cj, storage medium at some server or client in any one of a 

C2, . . . Cjt listed on a particular menu M and choose to number of ways, including, but not limited to: scanning, 

remove these, clusters from the menu, replacing them on the 15 keyboard input, e-mail, FTP transmission automatic synthe- 

menu with a single aggregate cluster M containing all the sis from another file under the control of another computer 

target objects from clusters C^, C2, . . . Cj^. In this case, the program. While a system to enable users to efficiently locate 

immediate subclusters of new cluster M are either taken to target objects may store its hierarchical cluster tree on a 

be clusters C2, . . . C^^. themselves, or else, in a variation single centralized machine, greater efficiency can be 

similar to the "scatter-gather" method, are automatically 20 achieved if the storage of the hierarchical cluster tree is 

computed by clustering the set of all the subclusters of distributed across many machines in the network. Each 

clusters Cj, Cj, . . . C,, according to the similarity of the cluster C, including single-member clusters (target objects), 

cluster profiles of these subclusters. is digitally represented by a file F, which is multicast to a 

Electronic Mall topical multicast tree MT(C1); here cluster CI is either 

In one appUcation, the browsmg techniques described 25 cluster C itself or some supercl\ister of cluster C. In this way, 

above may be applied to a domain where the target objects file F is stored at multiple servers, for redundancy. The file 

are purchasable goods. When shoppers look for goods to F that represents cluster C contains at least the following 

purchase over the Internet or other electronic media, it is data: 

typically necessary to display thousands or tens of thousands 1. The cluster profile for cluster C, or data sufficient to 

of products in a fashion., that helps consumers find the items 30 reconstruct this cluster profile. 2. The number of target 

they are looking for. The current practice is to use hand- objects contained in cluster C. 3. A human-readable label for 

crafted menus and sub -menus in which similar items are cluster C, as described in section. "Labefing Clusters"above. 

grouped together. It is possible to use the automated clus- 4. If the cluster is divided into subclusters, a fist of pointers 

tering and browsing methods described above to more to files representing the subclusters. Each pointer is an 

effectively group and present the items. Purchasable items 35 ordered pair containing naming, first, a file, and second, a 

can be hierarchically clustered using a plurality of different multicast tree or a specific server where that file is stored. 5. 

criteria. Useful attributes for a p\u-chasable item include but If the cluster consists of a single target object, a pointer to 

are not hmited to a textual description and predefined the file corresponding to that target object, 

category labels (if available), the unit price of die item, and The process by which a client machine can retrieve the file 

an associative attribute listing the users who have bought 40 F firom the multicast tree MT(C1) is described above in 

this item in the past. Also useful is an associative attribute section "Retrieving Files from a Multicast Tree." Once it has 

indicating which other items are often bought on the same retrieved file F the client can perform further tasks pertain- 

shopping "trip** as this item; items that are often bought on mg to this cluster, such as displaying a labeled menu of 

the same trip will be judged similar with respect to this subclusters, from which the user may select subclusters for 

attribute, so tend to be grouped together. Retailers may be 45 the cUent to retrieve next. 

interested in utilizing a similar technique for purposes of The advantage of this distributed implementation is three- 
predicting both the nature and relative quantity of items fold. First, the system can be scaled to larger cluster sizes 
which are Ukely to be popular to their particular clientele. and numbers of target objects, since much more searching 
This prediction may be made by using aggregate purchasing and data retrieval can be carried out concurrently. Second, 
records as the search profile set from which a collection of 50 the system is fault-tolerant in that partial matching can be 
target objects is recommended. Estimated customer demand achieved even ff portions of the system are temporarily 
which is indicative of (relative) inventory quantity for each unavailable. It is important to note here the robustness due 
target object item is determined by measuring the cluster to redundancy inherent in our design data is replicated at tree 
variance of that item compared to another target object item sites so that even if a server is down, the data can be located 
(which is in stock). 55 elsewhere. 

As described above, hierarchically clustering the- The distributed hierarchical cluster tree can be created in 

purchasable target objects results in a hierarchical menu a distributed fashion, that is, with the participation of many 

system, in which the target objects or clusters of target processors. Indeed, in most applications it should be recre- 

objects-that appear on each menu can be labeled by names ated from time to time, because as users interact with target 

or icons and displayed in a two-dimensional or three- 60 objects, the associative attributes in the target profiles of the 

dimensional menu in which similar items are displayed target objects change to reflect these interactions; the sys- 

physically near each other or on the same graphically tem*s similarity measurements can therefore take these 

represented "shelf." As described above, this grouping interactions into account when judging similarity, which 

occurs both at the level of specific items (such as standard allows a more perspicuous cluster tree to be built. The key 

size Ivory soap or large Breck shampoo) and at the level of 65 technique is the following procedure for merging n disjoint 

classes of items (such as soaps and shampoos). When the cluster trees, represented respectively: by files Fl . . . Fn in 

user selects a class of items (for instance, by chcking on it), distributed fashion as described above, into a combined 
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cluster tree that contains all the target objects from all these includes a pointer to a file F that represents the completed 

trees. The files Fl . . . Fn are described above, except that the hierarchical cluster tree. Server S multicasts file F to, all 

cluster labels are not included in the representation. The proxy servers in MT^„;;. Once the hierarchical cluster tree 

following steps are executed by a server SI, in response to has been created as above, server S can send additional 

a request message from another server SO, which request 5 messages through the cluster tree, to arrange that muhicast 

message includes pointers to the files Fl . . . Fo. 1. Retrieve ^'^^^ MT(C) are created for sufficiently large clusters C, and 

files Fl . . . Fn. 2. Let Land M be empty lists. 3. For each ^hat each file F is mulUcast to the tree MT(C), where C is the 

file Fi from among Fl . . . Fn: 4. If file Fi contains pointers ^^"^^^^^^ ^^^^^^^ containing file F. 

to subdlustcr files, add these pointers to: list L. 5. If file Fi MATCHING USERS FOR VIRTUAL 

represents a single -target object, add a pointer to file Fi to list 10 COMMUNITIES 

L. 6. For each pointer X on fist L, retrieve thc-file that Virtual Communities 

pointer P points to and extract the cluster profile P(X) that Computer users frequently join other users for discussions 
this file stores. 7. Apply a clustering algorithm to group the on computer bulletin boards, newsgroups, mailing lists, and 
pointers X on list L according to the distances between their real-time chat sessions over the computer network, which 
respective cluster profiles P(X). 8. For each (nonempty) 15 may be typed, (as with Internet Relay Chat (IRC)), spoken 
resulting group C of pointers: 9. If C contains only one (as with Internet phone), or videoconferenced. These forums 
pointer, add this pointer to list M; 10. otherwise, if C are herein termed "virtual communities." In current practice, 
contains exactly the same subcluster pointers as does one of each virtual community has a specified topic, and users 
the files Fi from among Fl . . . Fn, then add a pointer to file discover communities of interest by word of mouth or by 
Fi to list M; 11. otherwise: 12. Select an arbitrary server S2 20 examining a long list of communities (typically hundreds or 
on the network, for example by randomly selecting one of thousands). The users then must decide for themselves 
the pointers in group C and choosing the server it points to. which of thousands of messages they find interesting from 
13. Send a request message to server S2 that includes the among those posted to the selected virtual communities, that 
subcluster pointers in group C and requests server S2 to is, made publicly available to members of those communi- 
merge the corresponding subcluster trees. 14. Receive a 25 ties. If they desire, they may also write additional messages 
response from server S2, containing a pointer to a file G that and post them to the virtual communities of their choice. The 
represents the merged tree. Add this pointer to fist M. 15. For existence of thousands of Internet bulletin boards (also 
each file Fi from among Fl . . . Fn: 16. If list M does not termed newsgroups) and countless more Internet maiUng 
include a pointer to file Fl, send a message to the server or hsts and private biUletin board services (BBS's) demon- 
servers storing Fi instructing them to delete file Fi. 17. 30 strates the very strong interest among members of the 
Create and store a file F that represents a new cluster, whose electronic community in forums for the discussion of ideas 
subdluster pointers are exactly the subcluster pointers on list about almost any subject imaginable. Presently, virtual com- 
M. 18. Send a reply message to server SO, which reply munity creation proceeds in a haphazard form, usually 
message contains a pointer to file F and indicates that file F instigated by a single individual who decides that a topic is 
represents the merged cluster tree. 35 worthy of discussion. There are protocols on the Internet for 
With the help of the above procedure, and the multicast voting to determine whether a newsgroup should be created, 
tree MT full that includes all proxy servers in the network, but there is a large hierarchy of newsgroups (which begin 
the distributed hierarchical cluster tree for a particular with the prefix "alt.") that do not follow this protocol, 
domain of target objects is constructed by merging many The system for customized electronic identification of 
local hierarchical cluster trees, as follows. 1. One server S 40 desirable objects described herein can of course funaion as 
(preferably one with good connectivity) is elected from the a browser for bulletin boards, where target objects are taken 
tree. 2. Server S sends itself a global request message that to be btilletin boards, or subtopics of bulletin boards, and 
causes each proxy server in MTy-„„ (that is, each proxy server each target profile is the cluster profile for a cluster of 
in the network) to ask its clients for files for the cluster tree. documents posted on some bulletin board. Thus, a user can 
3. The clients of each proxy server transmit to the proxy 45 locate bulletin boards of interest by all the navigational 
server any files that they maintain, which files represent techniques described above, including browsing and query- 
target objects from the appropriate domain that shotild be ing. However; this method only serves to locate existing 
added to the cluster tree. 4. Server S forms a request Rl that, virtual communities. Because people have varied and vary- 
upon receipt, will cause the recipient server SI to take the ing complex interests, it is desirable to automafically locate 
following actions: (a) Build a hierarchical cluster tree of all 50 groups of people with common interests in order to form 
the files stored on server SI that-are maintained by users in virtual communities. The Virtual Community Service (VCS) 
the user base of SI. These files correspond to target objects described below is a network-based agent that seeks out 
from the appropriate domain. This cluster tree is typically users of a network with common interests, dynamically 
stored entirely on SI, but may in principle be stored in a creates bulletin boards or electronic mailing lists for those 
distributed fashion, (b) Wait until all servers to which the 55 users, and introduces them to each other electronically via 
server SI has propagated request R have sent the recipient e-mail. It is useful to note that once virtual communities 
reply messages containing pointers to cluster trees, (c) have been created by VCS, the other browsing and filtering 
Merge together the cluster tree created in step S(a) and the technologies described above can subsequently be used to 
cluster trees supplied in-step 5{b), by sending any server help a user locate particular virtual communities (whether 
(such as SI itself) a message requesting such a merge, as 60 pre-existing or automatically generated by VCS); similarly, 
described above, (d) Upon receiving a reply to the message since the messages sent to a given virtual community may 
sent in (c), which reply includes a pointer to a file repre- vary in interest and urgency for a user who has joined that 
senting the merged cluster tree, forward this reply to the community, these browsing and filtering technologies (such 
sender of request Rl, unless this is SI itself. 5. Server S as the e-mail filter) can also be used to alert the user to urgent 
sends itself a global request message that causes all servers 65 messages and to screen out uninteresting ones, 
in MT^^j to act on embedded request Rl. 6. Server S The functions of the Virtual Community Service are 
receives a reply to the. message it sent in 5(c). Tliis reply general functions that-could be implemented-on any net- 



01/14/2003, EAST Version: 1.03.0002 



us 6,460,036 Bl 



75 



76 



work ranging from an ofBce network- in a small company to 
the World Wide Web or the Internet. The four main steps in 
the procedure are: 1. Scan postings to existing virtual 
communities. 2. Identify groups of users with common 
interests. 3. Match users with virtual communities, creating 
new virtual communities when necessary, 4. Continue to 
enroll additional users in the existing virtual communities. 

More generally, users may post messages to virtual com- 
munities pseudonymously, even employing different pseud- 



scanned and profiled in the above step, based on the simi- 
larity of those messages computed target profiles, thus 
automatically finding threads of discussion that show com- 
mon interests among the users. Naturally, discussions in a 
single virtual community tend to show common interests; 
however, this method uses all the texts from every available 
virtual community, including bulletin boards and electronic 
mailing lists. Indeed, a user who wishes to initiate or join a 
discussion on some topic may send a "feeler message" on 



onyms for different virtual communities. (Posts not employ- lO that topic to a special mailing list designated for feeler mess 
ing a pseudonymous mix path may, as usual, be considered ages; as a consequence of the scanning procedure described 
to be posts employing a non-secure pseudonym, namely the above, the feeler message is automatically grouped with any 
user's tme network address.) Therefore, the above steps may similarly profiled messages that have been sent to this 
be expressed more generally as follows: 1. Scan pseudony- special mailing Ust, to topical mailing lists, or to topical 
mous postings to existing virtual communities. 2. Identify 15 bulletin boards. The clustering step employs "soft 
groups of pseudonyms whose associated users have com- clustering," in which a message may belong to multiple 
mon interests. 3. Match pseudonymous users with virtual cltisters and hence to multiple virtual communities. Each 
communities, creating new virtual communities when nec- cluster of messages that is found by Virtual Community 
essary. 4. Continue to enroll additional pseudonymous users Service and that is: of sufficient size (for example, 10-20 
in the existing virtual communities. Each of these steps can 20 different messages) determines a pre-community whose 



members are the pseudonymous authors and recipients of 
the messages in the cluster. More precisely, the pre- 
community consists of the various pseudonyms under which 
the messages in the cluster were sent and received. 

Alternative methods for determining a pre-community, 
which do not require the scanning step above, include the 
following: 1 , Pre -communities can be generated by grouping 
together users who have similar interests of any sort, not 
merely Individuals who have akeady written or received 



be carried out as described below. 
Scanning 

Using the technology described above. Virtual Commu- 
nity Service constantly scans all the messages posted to all 

the newsgroups and electronic mailing lists on a given 25 
network, and constructs a target profile for each message 
found. The network can be the Internet, or a set of bulletin 
boards maintained by America Online, Prodigy, or 
CompuServe, or a smaller set of bulletin boards that might 

be local to a single organization, for example a large 30 messages about similar topics. If the user profile associated 

company, a law firm, or a university. The scanning activity with each pseudonym indicates the user's interests, for 

need not be confined to bulletin boards and mailing lists that example through an associative attribute that indicates the 

were created by Mrtual Community Service, but may also be documents or Web sites a user likes, then pseudonyms can 

used to scan the activity of communities that predate Virtual be clustered based on the similarity of their associated user 

Community Service or are otherwise created by means 35 profiles, and each of the resxilting clusters of pseudonyms 

outside the Virtual Community Service system, provided determines a pre-community comprising the pseudonyms in 

that these communities are public or otherwise grant their the cluster. 2. If each pseudonym has an associated search 

permission. profile set formed through participation in the news clipping 

The target profile of each message includes textual service described above, then all search profiles of all 

attributes specifying the title and body text of the message. 40 pseudonymous users can be clustered based on their 

In the case of a spoken rather than written message, the latter similarity, and each cluster of search profiles determines a 

attribute may be computed from the acoustic speech data by pre-community whose members are the pseudonyms from 

using a speech recognition system. The target profile also whose search profile sets the search profiles in the cluster are 

includes an associative attribute listing the author(s) and drawn. Such groups of people have been reading about the 

designated recipient(s) of the message, where the recipients 45 same topic (or, more generally, accessing similar target 

may be individuals and/or entire virtual communities; if this objects) and so presumably share an interest: 3. If users 

attribute is highly weighted, then the system tends to regard participate in a news clipping service or any other filtering 

messages among the same set of people as being similar or or browsing system for target objects, then an individual 

related, even if the topical similarity of the messages is not user can pseudonymously request the formation of a virtual 

clear from their content, as may happen when some of the 50 community to discuss a particular cluster of one or more 



messages are very short. Other important attributes include 
the fraction of the message that consists of quoted material 
from previous messages, as well as attributes that are 
generally useful in characterizing documents, such as the 
message's date, length, and reading level. 
Virtual Community Identification 

Next, N^rtual Community Service attempts to identify 
groups of pseudonymous users with common interests. 
These groups, herein termed "pre-communities," are repre- 



target objects known to that system. This cluster of target 
objects determines a pre-community consisting of the pseud- 
onyms of users determined to be most interested in that 
cluster (for example; users who have search profiles similar 
55 to the cluster pro file), together with the pseudonym of the 
user who requested formation of the virtual community. 
Matching Users with Communities 

Once Virtual Community Service identifies a cluster C of 
messages, users, search profiles, or target objects that deter- 



sented as sets of pseudonyms. Whenever Virtual Community 60 mines a pre-community M, it attempts to arrange for the 



Service identifies a pre-community, it will subsequently 
attempt to put the users in said pre-community in contact 
with each other, as described below. Each pre-community is 
said to be "determined" by a cluster of messages, pseud- 
onymous users, search profiles, or target objects. 

In the usual method for determining pre- communities, 
Virtual Community Service clusters the messages that were 



members of this pre -cormn unity to have the chance to 
participate in a common virtual community V In many 
cases, an existing virtual community V may suit the needs of 
the pre-community M. Virtual Community Service first 
65 attempts to find such an existing community V. In the case 
where cluster C is a cluster of messages, V may be chosen 
to be any existing virtual community such that the cluster 
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profile of cluster C is within a threshold distance of the 
mean: profile of the set of messages recently posted to 
virtual community V; in the case where cluster C is a cluster 
of users, V may be chosen to be any existing virtual 
community such that the cluster profile of cluster C is within 5 
a threshold distance of the mean user profile of the active 
members of virtual community V; in the case where the 
cluster C is a cluster of search profiles, V may be chosen to 
be any existing virtual community such that the cluster 
profile of cluster C is within a threshold distance of the 10 
cluster profile of the largest cluster resulting from clustering 
all the search profiles of active members of virtual commu- 
nity V; and in the case where the cluster C is a cluster of one 
or more target objects chosen from a separate browsing or 
filtering system, V may be chosen to be any existing virtual 15 
community initiated in the same way from a cluster whose 
cluster profile in that other system is within a threshold 
distance of the cluster profile of cluster C. The threshold 
distance used in each case is optionally dependent on the 
cluster variance or cluster diameter of the profile sets whose 20 
means are being compared. 

If no existing virtual community V meets these conditions 
and is also wilHng to accept all the users in pre-community 
M as new members, then Virtual Community Service 
attempts to create a new virtual-community V. Regardless of 25 
whether virtual community V is an existing community or a 
newly created community, Virtual Community Service 
sends an e-mail message to each pseudonym P in pre- 
community M whose associated user U does not already 
belong to virtual community V (under pseudonym P) and 30 
has not previously turned down a request to join virtual 
community V. The e-mail message informs user U of the 
existence of virtual community V, and provides instructions 
which user U may following order to join virtual community 
V if desired; these instructions vary depending on whether 35 
virtual community V is an existing community or a new 
community. The message includes a credential, granted to 
pseudonym P; which credential must be presented by user U 
upon joining the virtual community V, as proof that user U 
was actually invited to join. If user U wishes to join virtual 40 
community V under a different pseudonym Q, user U may 
first transfer the credential from pseudonym P to pseudonym 
Q, as described above. The e-mail message further provides 
an indication of the common interests of the community, for 
example by including a list of titles of messages recently 45 
sent to the community, or a charter or introductory message 
provided by the community (if available), or a label gener- 
ated by the methods described above that identifies the 
content of the cluster of messages, user profiles, search 
profiles, or target objects that was used to identify the 50 
pre-community M, 

If Virtual Community Service must create a new commu- 
nity V, several methods are available for enabling the 
members of the new community to communicate with each 
other. If the pre-community M is large, for example con- ss 
taining more than 50 users, then Virtual Conmaunity Service 
typically establishes either a multicast tree, as described 
below, or a widely-distributed bulletin board, assigning a 
name to the new bulletin board. If the pre-community M has 
fewer members, for example 2-50, Virtual Community 60 
Service typically establishes either a multicast tree, as 
described below, or an e-mail mailing list. If the new virtual 
community V was determined by a cluster of messages, then 
Virtual Conmiunity Service kicks off the discussion by 
distributing these messages to all members of virtual com- 65 
munily V. In addition to bulletin boards and mailing lists, 
alternative form that can be created and in which virtual 
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communities can gather include real-time typed or spoken 
conversations (or engagement or distributed mnlti-user 
apphcations including video games) over the computer 
network and physical meetings, any of which can be sched- 
uled by a partly automated process wherein Virtual Com- 
munity Service requests meeting time preferences fi-om all 
members of the pre-commimity M and then notifies these 
individuals of an appropriate meeting time. 
Continued Enrollment 

Even after creation of a new virtual community. Virtual 
Community Service, continues to scan other virtual com- 
munities for new messages whose target profiles are similar 
to the community's cluster profile (average message profile). 
Copies of any such messages are sent to the new virtual 
community, and the pseudonymous authors of these 
messages, as well as users who show high interest in reading 
such messages, are informed by Virtual Community Service 
(as for pre-community members, above) that they may want 
to join the commtmity. Each such tiser can then decide 
whether or not to join the community. In the case of Internet 
Relay Chat (IRQ, if the target profile of messages in a real 
time dialog are (or become) similar to that of a user, VCS 
may also send an urgent e-mail message to such user 
whereby the user may be automatically notified as soon as 
the dialog appears, if desired. 

With these facilities, Virtual Community Service provides 
automatic creation of new virtual communities in any local 
or wide-area network, as well as maintenance of all virtual 
communities on the network, including those not created by 
Virtual Community Service. The core technology underly- 
ing Virtual Commnunity Service is creating a search and 
clustering mechanism that can find articles that are "similar" 
in that the users share interests. This is precisely what was 
described above. One must be sure that Mrtual Comnmunity 
Service does not bombard users with notices about commu- 
nities in which they have no real interest. On a very small 
network a human could be "in the loop", scanning proposed 
virtual communities and perhaps :even giving them names. 
But on larger networics Virtual Community Service has to 
run in filly automatic mode, since it is is likely to find a large 
number of virtual communities. 
DeUvering Messages to a Virtual Community 

Once a virtual community has been identified, it is 
straightforward for Virtual Community Service to establish 
a mailing list so that any member of the virtual community 
may distribute e-mail to all other members. Another method 
of distribution is to use a conventional network bulletin 
board or newsgroup to distribute the messages to all servers 
in the network, where they can be accessed by any member 
of the virtual community. However, these simple methods do 
not take into account cost and performance advantages 
which accrue from optimizing the constmction of a multi- 
cast tree to cany messages to the virtual community. Unlike 
a newsgroup, a multicast tree distributes messages to only a 
selected set of servers, and unlike an e-mail mailing list, it 
does so efficiently. 

A separate multicast tree MT(V) is maintained for each 
virmal community V, by use of the following four proce- 
dures. 1. To construct or reconstruct this multicast tree, the 
core servers for virtual community V are taken to be those 
proxy servers that serve at least one pseudonymnous mem- 
ber of virtual community V. Then the multicast tree MT(y) 
is established via steps 4-6 in the section "Multicast Ttee 
Construction Procedure" above. 2. When a new user joins 
virtual community V, which is an existing virtual 
community, the user sends a message to the user's proxy 
server S. If user's proxy server S is not already a core server 
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for V, then it is designated as a core server and is added to 
the multicast tree MT(V), as follows. If more than k servers 
have been added since the last time the multicast tree MT(V) 
was rebuih, where k is a function of the number of core 
servers already in the tree, then the entire tree is simply 5 
rebuilt via steps 4-6 in the section "Multicast Tree Con- 
struction Procedure" above. Otherwise, server S retrieves its 
locally stored list of nearby core servers for V, and chooses 
a server SI. Server S sends a control message to SI, 
indicating that it would like to be added to the multicast tree 10 
MT(V). Upon receipt of this message, server, SI retrieves its 
locally stored subtree Gl of MT(V), and forms a new graph 
G from Gl by removing all degree- 1 vertices other than SI 
itself. Server SI transmits graph G to server S which stores 
it as its locally stored subtree of MT(V). Finally, server S 15 
sends a message to itself and to all servers that are vertices 
of graph G, instructing these servers to modify their locally 
stored subtrees of MT(V) by adding S as a vertex and adding 
an edge between SI and S. 3. When a user at a client q 
wishes to send a message F to virtual community V, client 20 
q embeds message F in a request R instructing the recipient 
to store message F locally, for a limited time, for access by 
member s of virtual community V. Request R includes a 
credential proving that the user is a member of virtual 
community V or is otherwise entitled to post messages to 25 
virtual community V (for example is not "black marked" by 
that or other virtual community members). Ghent q then 
broadcasts request R to all core servers in the multicast tree 
MT(V), by means of a global request message transmitted to 
the user's proxy server as described above. The core servers 30 
satisfy request R, provided that they can verify the included 
credential. 4. In order to retrieve a particular message sent to 
virtual community V, a user U at client q initiates the steps 
described in section "Retrieving Files from a Miilticast 
Tree," above; If user U does not want to retrieve a particular 35 
message, but rather wants to retrieve all new messages sent 
to virtual community V, then user U pseudonymously 
instructs its proxy server (which is a core server for V) to 
send it all messages that were multicast to MT(V) after a 
certain date. In either case, user U must provide a credential 40 
proving user U to be a member of virtual community V, or 
otherwise entitled to access messages on virtual community 
V. 

SUMMARY 

45 

A method has been presented for automatically selecting 
articles of interest to a user. The method generates sets of 
search profiles for the users based on such attributes as the 
relative frequency of occurrence of words in the articles read 
by the users, and uses these search profiles to efiHciently so 
identify future articles of interest. The methods is charac- 
terized by passive monitoring (users do not need to explic- 
itly rate the articles), multiple search profiles per user 
(reflecting interest in multiple topics) and use of elements of 
the search profiles which are automatically determined from 55 
the data (notably, the TF/IDF measure based on word 
frequencies and descriptions of purchasable items). A 
method has also been presented for automatically generating 
menus to allow users to locate and retrieve articles on topics 
of interest. This method clusters articles based on their eo 
similarity, as measured by the relative frequency of word 
occurrences. Qusters arc labeled cither with article titles or 
with key words extracted from the article. The method can 
be applied to large sets of articles distributed over many 
machines. 65 

It has been further shown how to extend the above 
methods from articles to any class of target objects for which 
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profiles can be generated, including news articles, reference 
or work articles, electronic mail, product or service 
descriptions, people (based on the articles they read, demo- 
graphic data, or the products they buy), and electronic 
bulletin boards (based on the articles posted to them). A 
particular consequence of being able to group people by 
their interests is that one can form virtual communities of 
people of common interest, who can then correspond with 
one another via electronic mail. 
I claim: 

1. A method for providing a user with a customized 
electronic newspaper that is accessible via an electronic data 
transmission media, where said users arc connected via user 
terminals and data communication connections on said 
electronic data transmission media to a server system which 
provides said user with access to a pluraHty of target object 
news articles and advertisements, said method comprising 
the steps of: 

automatically generating separate target profiles for said 
plurality of target object news articles and advertise- 
ments that are accessible via said electronic data trans- 
mission media, each of said target profiles being gen- 
erated automatically, by a computer system running a 
profile generation algorithm, from the contents of an 
associated one of said target object news articles and 
advertisements; 

automatically generating at least one user target profile 
interest summary for a user at a user terminal, each said 
user target profile interest summary being generated 
firom said target object profiles associated with said 
news articles and advertisements accessed by said user; 

calculating numerical interest values between said at least 
one user target profile interest summary and said target 
profiles; and 

automatically creating a customized electronic newspaper 
for said user by presenting said user with a customized 
selection, as a function of said calculated numerical 
interest values, of said plurality of target object news 
articles and advertisements. 

2. The method of providing a user with a customized 
electronic newspaper of claim 1, further comprising the step 
of: 

automatically transmitting a notification to said user to 
identify newly received ones of said target object news 
articles and advertisements of interest to said user, as 
determined by at least one user target profile interest 
summary. 

3. The method of providing a user with a customized 
electronic newspaper of claim 2, wherein said step of 
automatically creating comprises: 

presenting to said user said newly received target object 
news articles in a rank order listing based upon the 
predicted level of interest by said user to said target 
object news articles according to matching criteria 
associated with said at least one user target profile 
interest summary. 

4. The method of providing a user with a customized 
electronic newspaper of claim 1, wherein said step of 
automatically creating comprises: 

dynamically creadng a rank ordered listing on said server 
of said customized selection of target object news 
articles and advertisements in accordance with at least 
one of: a predicted degree of interest of said user 
towards said target object news articles and 
advertisements, and electronic mailing lists for said 
user. 
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5. The method of providing a user with a customized 11. The system for providing a user with a customized 
electronic newspaper of claim 1 further comprising the step electronic newspaper of claim 8, wherein said means for 
of: automatically creating comprises: 

specifically identifying to said user said customized selec- means for dynamically creating a rank ordered listing on 

tion of target object news articles in which said target 5 said server of said customized selection of target object 

object news articles are accessible within an online news articles and advertisements in accordance with at 

directory. least one of: a predicted degree of interest of said user 

6. The method of providing a user with a customized towards said target object news articles and 
electronic newspaper of claim 1 further comprising the step advertisements, and electronic mailing lists for said 
of: 10 user. 

specifically identifying to said user said customized selec- 12. The system for providing a user with a customized 

tion of target object news articles by high lighting hyper electronic newspaper of claim 8 further comprising: 

Unks to said selected target object news articles. means for specifically identifying to said user said cus- 

7. The method of providing a user with a customized tomized selection of target object news articles in 
electronic newspaper of claim 1 further comprising the step which of said target object news articles are accessible 
of: ^ within an online directory. 

enabling said user to perform an on-line search of said 13. The system for providing a user with a customized 
plurality of target object news articles via said elec- electronic newspaper claim 8 further comprising: 
tronic transmission media; and means for specially identifying to said user said custom- 
customizing resulu of said on-line search based upon a izcd selection of target objects news articles by high 
predicted level of interest by said user to said target ^ lighting hyper hoks to said selected target object news 
object news articles within said on-line search results. articles. 

8. A system for providing a user with a customized 14. The system for providing a user with a customized 
electronic newspaper that is accessible via an electronic data electronic newspaper of claun 8 further comprising: 
transmission media, where said users are connected via user means for enabling said user to perform an on-line search 
terminals and data communication connections on said of said plurality of target object news articles via said 
electronic data transmission media to a server system which electronic transmission media; and 

provides said user with access to a plurality of target object means for customizing results of said on-line search based 

news articles and advertisements, said system comprising: upon a predicted level of interest by said user to said 

means for automatically generating separate target pro- ^ target object news articles within said on-line search 

files for said plurality of target object news articles and results. 

advertisements that are accessible via said electronic IS. A method for providing a user with access to selected 
data transmission media, each of said target profiles ones of a plurality of target object advertisemeots that are 
being generated automatically, by a computer system accessible via an electronic data transmission media, where 
running a profile generation algorithm, from the con- said users are connected via user terminals and data com- 
tents of an associated one of said target object news munication connections on said electronic data transmission 
articles and advertisements; media to a server system which provides said user with 
means for automatically generating at least one user target access to a plurality of buUetin boards, said method corn- 
profile interest summary for a user at a user terminal, prising the steps of: 

each said user target profile interest summary being ^ automatically generating target object profiles for target 

generated from said target object profiles associated object advertisements that are accessible by said elec- 

with said news articles and advertisements accessed by tronic data transmission media, each of said target 

said iiser object profiles being generated automatically, by a 

means for calculating numerical interest values between computer system running a profile generation 

said at least one user target profile interest summary 45 algorithm, from the contents of an associated one of 

and said target profiles; and said target object advertisements; 

means for automaticaUy creating a customized electronic automatically generating at least one user target profile 

newspaper for said user by presenting said user with a summary for a user at a user terminal, each said user 

customized selecUon, as a fiinction of said numerical target profile interest summary being generated from 

interest values, of said plurality of target object news 50 said plurality of target object profiles associated with 

articles and advertisements. said target object advertisements accessed by said user; 

9. The system for providing a user with a customized calculating numerical interest values between said at least 
electronic newspaper of claim 8, further comprising, means one user target profile interest summary and said target 
for automatically transmitting a notification to said user to object profiles; and 

identify newly received ones of said target object news ss generating a customized selection, as a function of said 

articles and advertisements of interest to said user, as deter- numerical interest values, of said plurality of target 

mined by said at least one user target profile interest sum- object advertisements. 

mary. 16. The method of providing a user with a customized 

10. The system for providing a user with a customized electronic newspaper of claim 15 wherein said user is a 
electronic newspaper of claim 9, wherein said means for go consumer of an off-line newspaper, further comprising the 
automatically creating comprises: step of: 

means for presenting to said user said newly received delivering said selected ones of said plurality of adver- 

target object news articles in a rank order listing based tisements to said user via an inserted piece of printed 

upon the predicted level of interest by said user to said material In said off-line newspaper, 

target object news articles according to matching cri- 65 17. The method of providing a user with a customized 

tcria associated with at least one user target profile electronic newspaper of claim 15, wherein said step of 

interest summary. automatically creating comprises: 
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dynamically creating a rank ordered listing on said server 
of said customized selection of target object advertise- 
ments in accordance with at least one of: a predicted 
degree of interest of said user towards said 
advertisements, and electronic mailing lists for said 
user. 

18. A system for providing a user with access to selected 
ones of a plurality of target object advertisements that are 
accessible via an electronic data transmission media, where 
said users are connected via user terminals and data com- 
munication connections on said electronic data transmission 
media server system which provides said user with access to 
a plurality of bulletin boards, said method comprising: 
means for automatically generating target object profiles 
for target object advertisements that are accessible by 
said electronic data transmission media, each of said 
target object profiles being generated automatically, by 
a computer system running a profile generation 
algorithm, from the contents of an associated one of 
said target object advertisements; 

means for automatically generating at least one of said 
target object advertisements; 

means for automatically generating at least one user target 
profile interest summary for a user at a user terminal, 
each said user target profile interest summary being 
generated from said plurality of target object profiles 
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associated with said target object advertisements 
accessed by said user; 
means for calculating numerical interest values between 
said at least one user target profile interest summary 
and said target object profiles; and 

means for generating a customized selection, as a function 
of said numerical interest values, of said plurality of 
target object advertisements. 

19. Hie system for providing a user with a customized 
electronic newspaper of claim 18, wherein said user is a 
consumer of an off-line newspaper, further comprising: 

means for delivering said selected ones of said plurality of 
advertisements to said user via an inserted piece of 
printed material in said off-line newspaper, 

20. The system for providing a user with a customized 
electronic newspaper of claim 18, wherein said means for 
automatically creating comprises: 

means for dynamically creating a rank ordered listing on 
said server of said customized selection of target object 
advertisements in accordance with a least one of: a 
predicted degree of interest of said user towards said 
advertisements, and electronic mailing lists for said 
user. 
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